讲解STATS762、辅导R编程设计、R设计讲解、辅导data 辅导Python编程|辅导Python编程

STATS762 Regression for Data Science
Assignment 3
Due date: 10am, 1 June 2020
Instruction
• Please submit both your R Markdown document and a pdf file containing
the document it generates. To create a pdf you should start your R Markdown
document with the following lines (having made the appropriate
changes):
---
title: "STATS 762 Assignment 3"
author: "Your Name, ID 1234567"
date: "Due: 10am, 1 June 2020"
output: pdf_document
---
• Add the set.seed-function before your R-script to obtain the same output
when it is resimulated.
• All answers should be written with corresponding question numbers.
• Working must be shown.
• Each answer should be written explicitly and a R-code itself does not
make an answer.
For example, the question is finding an average height of 6 trees: (1, 2, 1,
3, 1.5).
Good answer Bad answer
• If any of above is unsatisfied, a penalty may be applied.
1. The spreadsheet avocado2.csv contains historical 338 avocado sales in
various markets in California, US. The attributes follow;
Total.Volume Total number of sold avocados
AveragePrice Average price of a single avocado
type Production type; organic and conventionally produced avocados
1
A researcher wants to investigate how the amount of sales relates to an average
price and a production type (organic/conventional). Total.Volume
is transformed in a log-scale to fit a linear regression model with AveragePrice
and type.
(a) Write how a log-transformed total number of sold avocados is useful
for modelling a quantile using a linear regression. [2 marks]
(b) Find a suitable linear regression model for the 0.2 quantile of log(Total.Volume)
and express a typical 0.2 quantile of total number of sold avocados
for a given price and production type. [5 marks]
(c) Find a suitable linear regression model for the 0.8 quantile of log(Total.Volume)
and express a typical 0.8 quantile of total number of sold avocados
for a given price and production type. [5 marks]
(d) Using your model, predict the 0.2 quantile of the total sales for $1.2
conventional avocados and $1.8 organic avocados. [1 marks]
(e) What conventional avocado price does result that 80% of markets
sold at most 5.4 millions avocados? [3 marks]
2. The spreadsheets (banktrain.csv and banktest.csv) are related with
direct marketing campaigns of a bank. The marketing campaigns were
based on phone calls. Often, more than one contact to the same client was
required, in order to access if the product (bank term deposit) would be
(or not) subscribed. The interest is to predict if the client will subscribe a
term deposit (variable y).
The attributions follow;
gender - gender (categorical: ”male”,”female”)
age - age (numeric)
marital - marital status (categorical: ”married”,”divorced”,”single”)
education - education information of client (categorical: ”unknown”,”secondary”,”primary”,”tertiary”)
default - credit account status (categorical: ”yes”,”no”)
balance - average yearly balance, in euros (numeric)
housing - housing loan status (categorical: ”yes”,”no”)
loan - personal loan status (categorical: ”yes”,”no”)
contact - contact communication type (categorical: ”unknown”,”telephone”,”cellular”)
duration - last contact duration, in seconds (numeric)
campaign - number of contacts performed during this campaign and for this client (numeric)
previous - number of contacts performed before this campaign and for this client (numeric)
poutcome - outcome of the previous marketing campaign (categorical: ”unknown”,”other”,”failure”,”success”)
y - Has the client subscribed a term deposit? (categorical: ”yes”,”no”)
2
We use the train data (banktrain.csv) to find a model and the test data
(banktest.csv) to examine the predictability of a model. Note that the
number of cross validation folders is 10.
The function in make.r reforms a data that each categorical variable creates
indicator variables corresponding to categorical levels. It produces
a list with two objects; a reformed data (data) and a vector of group
memberships (gpname).
(a) Using the train data, complete the following questions.
i. Using an appropriate penalty on the model complexity, find a
model minimizing the cross validation error. Show how you
found the model and describe the model with the client characters
included. [4 marks]
ii. Using an appropriate penalty on the model complexity, find
a parsimonious model. Show how you found the model and
describe the model with the client characters included. [4 marks]
(b) Estimate the predictability of each model using an appropriate measure
and, compare the predictability. [3 marks]
(c) Using your parsimonious model, describe a type of client who is
very likely to subscribe a term deposit. [3 marks]
(d) If a marketing focuses on a single client character what would be the
feature to succeed the marketing campaign? [3 marks]
3