辅导 STATS 3860B/9155B Winter 2024 Assignment 1讲解留学生SQL语言程序

Assignment 1

STATS 3860B/9155B - Instructor: Dr. Camila de Souza

Winter 2024

• This assignment is due Friday, January 26, 2024, at 11:55 pm.

• You must write your answers and R code using Rmarkdown (template provided) and generate a single PDF ﬁle. Submissions not generated by Rmarkdown will not be graded and receive zero marks.

• Submissions must be done via Gradescope. You must carefully assign questions to their corresponding pages. Submissions without questions assigned to pages will not be graded. Questions with no pages assigned to them will receive zero marks.

• Always show all your work and add comments to your code explaining what you are doing.

• Each student must submit their own work and use their own words to answer questions and explain results. The use of AI tools (such as ChatGPT) to generate answers is prohibited. Scholastic oﬀences are taken seriously, and students are directed to read the appropriate policy, speciﬁcally, the deﬁnition of what constitutes a Scholastic Oﬀence, at the following Web site: http://www.uwo.ca/univsec/pdf/acade mic_policies/appeals/scholastic discipline_undergrad.pdf

Question 1

Consider the sex ratio at birth example explored in Sections 2.2 to 2.4 of the Beyond Multiple Linear Regression book by Roback and Legler.

a) Write out the likelihood for Model 0, that is, the model that assumes that the probability of having a boy is independent of the sex of the other children, with the probability of having a girl equals the probability of having a boy (pB = pG = 0.5). Consider the hypothetical data in Table 2.1 of Roback and Legler (Case 1: 30 boys and 20 girls) and carry out a likelihood ratio test (LRT) comparing Model 1 versus Model 0 to determine whether there is statistical evidence that the two probabilities are not equal.

b) Repeat the LRT from part a) but now consider the hypothetical data with 600 boys and 400 girls (Case 2). How does the result from this test compare with the result from part a)? Comment on your ﬁndings.

c) Consider Case 3, a hypothetical data set on family composition with 6000 boys and 4000 girls.

• Plot Lik(pB ) and log(Lik(pB )) under the independence model (Model 1) for Case 3, highlighting the approximated MLE obtained numerically by a grid search.

• Describe how the likelihood and log-likelihood plots for Case 3 compare to the plots for Cases 1 and 2 shown in Figure 2.4 of Roback and Legler.

• Compute the exact MLE under Model 1 for Case 3.

• Why is it incorrect to perform an LRT to compare Cases 1, 2 and 3?

Question 2

Consider the simple linear regression model yi = β0 + β1 xi + Ei , for i = 1, . . . , n, where Ei ’s are iid N (0, σ2 ).

a) Write down the likelihood and the log-likelihood for the simple linear regression model.

b) Find the exact maximum likelihood estimates of β0 and β1 using calculus. Show all your work. How do the MLEs compare to the least squares estimates of β0 and β1 ?

Question 3

The data set urine contains 77 observations (after excluding two with missing data) on seven variables. They correspond to urine specimens that were analyzed in an eﬀort to determine if certain physical characteristics of the urine might be related to the formation of calcium oxalate crystals. Please run the code below and use the command ?urine to check the description of each variable.

suppressMessages(library(boot))

which(is.na(urine),arr. ind = TRUE)

## row col

## 55 55 4

## 1 1 5

urine <- urine[-c (1 ,55),]

urine$r <- factor(urine$r, levels= c ("0" , "1"),labels = c ("no" , "yes")) str(urine)

## ' data. frame ' : 77 obs . of 7 variables:

## $ r : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 . . . ## $ gravity: num 1.02 1.01 1.01 1 1.02 . . .

## $ ph : num 5.74 7.2 5.51 6.52 5.27 5.62 5.67 5.41 6.13 6.19 . . . ## $ osmo : num 577 321 408 187 668 . . .

## $ cond : num 20 14.9 12.6 7.5 25.3 17.4 35.9 21.9 25.7 11.5 . . . ## $ urea : num 296 101 224 91 252 195 550 170 382 152 . . .

## $ calc : num 4.49 2.36 2.15 1.16 3.34 1.4 8.48 1.16 2.21 1.93 . . .

a) Explore the data graphically in order to investigate the association between the response variable r (presence of crystals) and each covariate. Which covariates seem most likely to be useful in predicting r? Describe your ﬁndings.

b) Fit a logistic regression with r as the response variable and the other six variables as predictors. Report the residual deviance and associated degrees of freedom. Can this information be used to assess if this model ﬁts the data? Explain.

c) Conduct a test of hypothesis to assess if the model ﬁtted in b) is better than a model with just the intercept. What are the null and alternative hypotheses? What is the test statistic and its asymptotic distribution? What is the conclusion? Explain.

d) Use the AIC criterion to determine the best subset of variables.

e) Produce an ROC curve based on the selected model in d). What is the area under the curve? What is the best probability threshold and its corresponding sensitivity and speciﬁcity?

f) Based on the best threshold found in e), compute the confusion matrix and use its entries to compute the false positive rate, true positive rate, positive predictive value, and negative predictive value.

g) Based on the results in e) and f) comment on the eﬀectiveness of the ﬁtted logistic regression model in predicting the presence of oxilate crystals in the urine.

h) It is usually misleading to use the same data to ﬁt a model and test its predictive ability. To investigate this, we randomly assign 51 out of the 77 observations to a training set and the remaining 26 to a test set (see code below, keep the same seed). Now with the data split into training and test sets, do the following:

1) Use the training set to determine the best model. Which variables remain in the model? Are they the same as in d)?

2) Use the test set to assess the performance of the model in 1). Compare the outcome to the results obtained in e) and f).

set.seed(10)

train_ind <- sample(1:77 ,51)

train_urine <- urine[train_ind,]

test_urine <- urine[-train_ind,]

Question 4

A Biologist analyzed an experiment to determine the eﬀect of moisture content on seed germination. Eight boxes of 100 seeds each were treated with the same moisture level. 4 boxes were covered and 4 left uncovered. The process was repeated at 6 diﬀerent moisture levels (nonlinear scale). The data were ordered in blocks of 6 observations per box.

library(faraway)

## Attaching package: ' faraway '

## The following objects are masked from ' package:boot ' :

## logit, melanoma

data(seeds)

## creating a new predictor describing the box:

seeds$box <- factor(x=rep(1:8 , c (6 ,6 ,6 ,6 ,6 ,6 ,6 ,6)),

levels=c("1" , "2" , "3" , "4" , "5" , "6" , "7" , "8"))

## removing one observation with missing data

(seeds[is.na(seeds$germ),])

## germ moisture covered box

## 47 NA 9 yes 8

seeds <- seeds[! is.na(seeds$germ),]

str(seeds)

## ' data. frame ' : 47 obs . of 4 variables:

## $ germ : num 22 41 66 82 79 0 25 46 72 73 . . .

## $ moisture: num 1 3 5 7 9 11 1 3 5 7 . . .

## $ covered : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 . . .

## $ box : Factor w/ 8 levels "1","2","3","4",..: 1 1 1 1 1 1 2 2 2 2 . . .

a) The response variable germ contains the number of seeds that germinated out of 100.

Fit a binomial regression model including box and moisture as predictors.

b) Interpret the estimated coeﬃcients of moisture and box4.

c) What are the two hypothesis tests we can use to assess the goodness of ﬁt for the model in a)? Perform one of those tests. Is there statistical evidence for lack of ﬁt?

d) What are the other common causes for a deviance value to be larger than expected besides over/under-dispersion?

e) Suppose we have eliminated the causes listed in d) as the source of the problem, so that we can now put the blame on over/under-dispersion. Estimate the dispersion parameter and comment if the problem is over or underdispersion.

f) Test for the signiﬁcance of the individual predictors (moisture and box) accounting for overdispersion.

g) Test for the signiﬁcance of individual predictors (moisture and box) ignoring overdisper- sion. How do the results diﬀer from e)?

Question 5

An earlier study examined the eﬀect of workplace rules in Minnesota which require smokers to smoke cigarettes outside. The number of cigarettes smoked by smokers in a 2-hour period was recorded, along with whether the smoker was at home or at work. A (very) small subset of the data appears in Table 1 and it is also available in the smoking. csv ﬁle.

Table 1: A small subset of hypothetical data on Minnesota workplace rules on smoking.

subject x (location) y (cigarettes)

1 0 3

2 1 0

3 1 0

4 1 1

5 0 2

6 0 1

• Model 1: Assume that Y ∼ Poisson(λ); there is no diﬀerence between home and work.

• Model 2: Assume that Y ∼ Poisson(λW ) when the smoker is at work, and Y ∼ Poisson(λH ) when the smoker is at home.

• Model 3: Assume that Y ∼ Poisson(λ) and log(λ) = β0 + β1 x.

a) Write out the likelihood L(λ) and log-likelihood log L(λ) in Model 1. Use the data values in Table 1, and simplify where possible.

b) Intuitively, what would be a reasonable estimate for λ based on this data? Why?

c) Use R to produce a plot of the likelihood function L(λ) . Find the maximum likelihood estimator for λ in Model 1 using an optimization routine in R (for example, optim(), but not the glm() function).

d) Write out the log-likelihood function log L(λW , λH ) in Model 2. Use the data values in Table 1, and simplify where possible.

e) Intuitively, what would be reasonable estimates for λW and λH based on this data? Why?

f) Find the maximum likelihood estimators for λW and λH in Model 2 using an optimization routine in R (for example, optim(), but not the glm() function).

g) Write out the log-likelihood function log L(β0 , β1 ) in Model 3. Again, use the data values in Table 1, and simplify where possible.

h) Find the maximum likelihood estimators for β0 and β1 in Model 3 using an optimization routine in R (for example, optim(), but not the glm() function). Use R to produce a 3D plot of the log-likelihood function.

i) Conﬁrm your estimates for Model 1 and Model 3 using glm(). Then show that the MLEs for Model 3 agree with the MLEs for Model 2.

联系我们