Question 1 Download data Q1

Question 1(9 points)

Download data Q1.Rdata and its codebook Q1 codebook.txt from Canvas.

This is a dataset about a personality test. I have modified the raw dataset

1 for the exam question. Now the dataset contains responses from 2000 test

takers to 50 variables (personality test questions).

(a) Compare the variables with the codebook and try to understand the meaning of the numbers in the dataset. Also, read introduction to big-five personality traits theory to get familiar with the background. 2

Answer the questions below:

• What is the value of Q1[10,"EXT2"]? What is the practical meaning

of this value?

• What does the abbreviation ‘EST’ in the variable name represent?

(b) Fit the data with an oblique factor analysis model. Please use maximum

likelihood method for estimation and use the method ‘promax’ for oblique

rotation.

• Present the fitted loading matrix and indicate loadings with absolute

values larger than 0.3. (Hint: you can try the function print to

print the loadings and use the argument cut= 0.3 to control the cuto↵ value for showing results.)

• Present the estimated correlation matrix of the latent factors. Are the

latent factors uncorrelated or correlated?

• Interpret the meaning of the latent factors. Does the loadings of the

latent factors align with the big-five personality traits?

• Some values in the loading matrix are positive. Some are negative.

Could you explain why?

• Among the big-five personality traits, which two of them are most

correlated?

Question 2 (Total 9 points)

In this question, we analyze the dataset LetterRecognition from the package

mlbench. After loading the dataset, you should obtain a 20000 ⇥ 17 dataset.

Our goal is to predict the variable lettr using all the other variables in the

dataset. We will build our classifiers using linear discriminant analysis (LDA)

and quadratic discriminant analysis (QDA), and evaluate their performance

using 5-fold cross validation.

1The dataset and codebook are obtained from https://openpsychometrics.org. 2For example, you can refer to this introduction: https://www.123test.com/

big-five-personality-theory

(a) Fit an LDA classifier on LetterRecognition. Predict the classes for the

observations from the same dataset. Estimate the error rate.

(b) Repeat part (a) with QDA. Which classifier is better based on the estimated

error rate?

is the K-fold cross validation. In this problem, we use 5-fold cross validation

to compare the accuracy (one minus error rate) of LDA and QDA.

• Read the section ‘K-fold cross-validation’ of the article ‘Cross-Validation

Essentials in R’. (See the link in the footnote for a reference.3)

Explain, in a few sentences with your own words, the main steps of a

5-fold cross-validation for estimating the accuracy of a classifier.

• The same article also provides some sample codes for implementing

a K-fold cross-validation using the package caret in R. Also, you can

refer to the website4 for more sample codes of comparing di↵erent

models using cross-validation.

In this question, use a 5-fold cross-validation to compare the accuracy

of LDA and QDA. Report estimated accuracy for LDA and QDA using

cross-validation and report the classifier that performs better. (Hint:

you can try to change the argument method="lda" or method="qda"

in the function train to control the classifier. )

Question 3 (Total 7 points)

Load the data set Satellite in the package mlbench. This data set contains

6435 observations with 37 variables. In this question, we cluster the observations

using the variables x.1, ..., x.36. That is, all the variables except for classes.

(a) In this part, perform k-means clustering with 1-15 clusters on this data

set. Calculate the within group sum of squares and plot the ”scree type”.

Report the number of cluster you will be choosing according to the plot.

(b) In this part, perform k-means clustering with 6 clusters. Report the Rand

index between the true clusters and the cluster you estimated from k-means

clustering.

(c) In this part, perform a hierarchical clustering. Cut the tree to get 6 clusters. Report the Rand index between the true clusters and the cluster you

estimated from hierarchical clustering. Compare your result with part (b).

Which clustering rule performs better?

3http://www.sthda.com/english/articles/38-regression-model-validation/

157-cross-validation-essentials-in-r/ 4https://machinelearningmastery.com/compare-models-and-select-the-best-using-the-caret-r-package/

联系我们