首页 > > 详细

讲解R、R程序讲解留学生、讲解Statistical Inference with R

Data Analysis and Statistical Inference with R - Spring 2018
Homework 4
DUE IN: Friday, 09.03.2018 at 23.59,
HOW: electronically in pdf-format via submission to www.turnitin.com
Class id: depends on lab group (see announcement on piazza.com)
enrollment password: 20TiTaNic18
Please register for the class on turnitin ahead of time.
GROUP WORK: is allowed with a maximum of 2 persons per group. PLEASE stay within the
same group throughout the semester. Only one solution is accepted and graded per group.
Please include the names of all group members on each assignment.
HOW MANY: There will be a total of six homework assignments in this semester. We will do
a random selection of questions to be graded. Each week a total of ten points can be gained.
Only the ve best homeworks will be counted.
DUE DATES: 16.02., 23.02., 02.03., 09.03., 16.03., 23.03. (tentatively, subject to change)
FORMAT: Please do the required analyses and provide answers in complete sentences. Pro-
vide the R syntax for the commands. Extract and report those statistics that are
relevant; do not copy complete R output without providing proper answers to the assignment
questions. Integrate requested gures or tables into your document and give a brief verbal
comment/caption on them.
Credit card approvals
You work at American Express supervising the credit card approval division. In order to improve
your team’s e ciency, you aim at automatizing part of the approval process.
The data set creditcard (an R data set, stored in the le (creditcard.Rdata) on campusnet)
contains fteen variables. You are only interested in the following six variables
Gender Is applicant female or male? (female = 0, male = 1)
Children Do children live in applicant’s household? (no = 0, yes = 1)
MaritalStatus is applicant married? (not married = 0, married = 1)
HomeOwner Does applicant own a home? (no = 0, yes = 1)
SavingsType di erent types of saving accounts (regular = 1, money market = 2, or certi cates of
deposite (CDs) = 3)
CreditCard dichotomous indicator whether or not the credit card application was approved (0=
no, 1= yes)
1. From your previous analysis you know that HomeOwnership is an important predictor for
getting a credit card approved. You hence limit your analysis in this homework to homeowning
applicants only. Create a subset of the data just including the homeowners.
(a) (half a point) For this subset, compute the median income.
(b) (half a point) For this subset, compute the standard deviation of income.
(c) (half a point) How many applicants are in this subset?
(d) (half a point) How many of the applicants in the subset are married?
(e) (half a point) How many applicants in the subset own a home?
2. On a typical working day, your team is able to process 120 credit card applications. To
simulate this situation you draw a random sample of size 120 from the data subset generated
in Question 1. (In order to make the results reproducible, use set.seed(201803) prior to
drawing the sample.) Based on this sample of size 120, you want to test the null hypothesis
that the mean income in the population is equal to 4750 USD. [hint: use the command t.test
to perform. a one-sample t-test to answer this question.]
(a) (1 point) Based on the result obtained, do you conclude to reject the null hypothesis of
the true population mean being equal to 4750 USD?
(b) (half a point) How large is the test-statistic?
(c) (half a point) How large is the corresponding p-value?
(d) (half a point) Does the 95%-con dence interval contain the score 4.75?
3. To check whether R actually computes the right thing, you decide to double check.
(a) (1 point) You rst compute the mean and standard deviation of income in your sample
and report these numbers.
(b) (half a point) Next you compute the standard error of the mean by dividing the standard
deviation of your sample by the square root of the sample size.
(c) (1 point) Finally, you compute the test statistic t which is the ratio of the di erence
between sample mean and hypothetical value and the standard error of the mean.
4. Now, you compare the empirical results with the corresponding theoretical distribution.
(a) (1 point) Compute the 2.5% quantile and the 97.5% quantile of the t-distribution with
119 degrees of freedom. Does the test statistic fall inside this range?
(b) (1.5 points) Compute the probability that a random variable that follows a t-distribution
with 119 degrees of freedom takes on values that are in absolute values larger than the
observed test-statistic i.e. P(T j2:1488j).
5. Now, you simulate a full years work of your team, by drawing a total of 220 samples of size
120 from the income variable in the credit card data set.
(a) (1 point) Compute the median income for each sample. Report the median of the sample
medians as well as the interquartile range of the sample medians.
(b) (1 point) Draw a boxplot of the sample medians. Based on this plot comment on the
sampling distribution of the median income!
(c) (half a point) Compute the 0.025-quantile and the 0.975-quantile of your sampling dis-
tribution of the median income.
6. Using the data obtained in Question 5 compute the following:
(a) (1 point) Compute the mean income for each sample. Report the mean of the sample
means as well as the standard deviation of the sample means.
(b) (1 point) Draw a boxplot of the sample means. Based on this plot comment on the
sampling distribution of the mean income!
(c) (half a point) Compute the 0.025-quantile and the 0.975-quantile of your sampling dis-
tribution of the mean income.
7. (2.5 points) Coming back to your subset data of homeowners (see Question 1), you want to
investigate if there is a gender bias in income. For that, brie y describe in plain English
the distributions of income, separately for males and females (variable Gender). Use relevant
numerical summaries as well as one graphical representation for each of the two distributions.
8. Again using the subset data for homeowners, you want to see whether the di erence in means
is large in comparison to the spread of the data.
(a) (1 points) Calculate the means ( xinc:f; xinc:m) and the standard deviations (sinc:f;sinc:m)
of income separately. Now calculate the test-statistic of the independent-samples t-test
t = xinc:f xinc:mr
(n1 1)s2inc:f+(n2 1)s2inc:m
n1+n2 2

1
n1 +
1
n2
;
where n1 is the number of females and n2 the number of males in the data set.
(b) (1 point) Using a t-distribution with n1 + n2 2 degrees of freedom, calculate the
probability of a t-distributed random variable being larger than or equal the above
calculated t-statistic score.
(c) (half a point) Based on the results so far, compute the probability under the null hy-
pothesis to obtain a result for the test statistic that is as extreme as the one we have
obtained.
9. (2.5 points) Use the function t-test to check with an independent samples t-test whether
income signi cantly di ers between males and females in your subset of homeowners. Assume
equal variances for the two groups. State the statistical null hypothesis to be tested as well
as the alternative hypothesis.
Take a look at the output and compare it with your results above.
10. (2.5 points) Visualise the previous results. Draw a plot for the pdf of the t-distribution with
the adequate number of degrees of freedom for the test statistic t. Color the areas under the
pdf for all values smaller than t and larger than t.

联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!