首页 > > 详细

Question 1 Download data Q1

 Question 1(9 points)

Download data Q1.Rdata and its codebook Q1 codebook.txt from Canvas.
This is a dataset about a personality test. I have modified the raw dataset
1 for the exam question. Now the dataset contains responses from 2000 test
takers to 50 variables (personality test questions).
(a) Compare the variables with the codebook and try to understand the mean￾ing of the numbers in the dataset. Also, read introduction to big-five per￾sonality traits theory to get familiar with the background. 2
Answer the questions below:
• What is the value of Q1[10,"EXT2"]? What is the practical meaning
of this value?
• What does the abbreviation ‘EST’ in the variable name represent?
(b) Fit the data with an oblique factor analysis model. Please use maximum
likelihood method for estimation and use the method ‘promax’ for oblique
rotation.
• Present the fitted loading matrix and indicate loadings with absolute
values larger than 0.3. (Hint: you can try the function print to
print the loadings and use the argument cut= 0.3 to control the cut￾o↵ value for showing results.)
• Present the estimated correlation matrix of the latent factors. Are the
latent factors uncorrelated or correlated?
(c) Compare the fitted results with the codebook. Answer the following ques￾tions.
• Interpret the meaning of the latent factors. Does the loadings of the
latent factors align with the big-five personality traits?
• Some values in the loading matrix are positive. Some are negative.
Could you explain why?
• Among the big-five personality traits, which two of them are most
correlated?
Question 2 (Total 9 points)
In this question, we analyze the dataset LetterRecognition from the package
mlbench. After loading the dataset, you should obtain a 20000 ⇥ 17 dataset.
Our goal is to predict the variable lettr using all the other variables in the
dataset. We will build our classifiers using linear discriminant analysis (LDA)
and quadratic discriminant analysis (QDA), and evaluate their performance
using 5-fold cross validation.
1The dataset and codebook are obtained from https://openpsychometrics.org. 2For example, you can refer to this introduction: https://www.123test.com/
big-five-personality-theory
2
(a) Fit an LDA classifier on LetterRecognition. Predict the classes for the
observations from the same dataset. Estimate the error rate.
(b) Repeat part (a) with QDA. Which classifier is better based on the estimated
error rate?
(c) A more reasonable approach for comparing di↵erent classification methods
is the K-fold cross validation. In this problem, we use 5-fold cross validation
to compare the accuracy (one minus error rate) of LDA and QDA.
• Read the section ‘K-fold cross-validation’ of the article ‘Cross-Validation
Essentials in R’. (See the link in the footnote for a reference.3)
Explain, in a few sentences with your own words, the main steps of a
5-fold cross-validation for estimating the accuracy of a classifier.
• The same article also provides some sample codes for implementing
a K-fold cross-validation using the package caret in R. Also, you can
refer to the website4 for more sample codes of comparing di↵erent
models using cross-validation.
In this question, use a 5-fold cross-validation to compare the accuracy
of LDA and QDA. Report estimated accuracy for LDA and QDA using
cross-validation and report the classifier that performs better. (Hint:
you can try to change the argument method="lda" or method="qda"
in the function train to control the classifier. )
Question 3 (Total 7 points)
Load the data set Satellite in the package mlbench. This data set contains
6435 observations with 37 variables. In this question, we cluster the observations
using the variables x.1, ..., x.36. That is, all the variables except for classes.
(a) In this part, perform k-means clustering with 1-15 clusters on this data
set. Calculate the within group sum of squares and plot the ”scree type”.
Report the number of cluster you will be choosing according to the plot.
(b) In this part, perform k-means clustering with 6 clusters. Report the Rand
index between the true clusters and the cluster you estimated from k-means
clustering.
(c) In this part, perform a hierarchical clustering. Cut the tree to get 6 clus￾ters. Report the Rand index between the true clusters and the cluster you
estimated from hierarchical clustering. Compare your result with part (b).
Which clustering rule performs better?
3http://www.sthda.com/english/articles/38-regression-model-validation/
157-cross-validation-essentials-in-r/ 4https://machinelearningmastery.com/compare-models-and-select-the-best-using-the-caret-r-package/
联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!