辅导 APSTAT.GE.2110 AP PLIED STATISTICS: USING LARGE DATABASES IN AP PLIED RESEARCH Problem Set 9辅导留学

APSTAT.GE.2110 APPLIED STATISTICS:

USING LARGE DATABASES IN APPLIED RESEARCH

Problem Set 9 – 40 total points

Instructions: In this assignment, you will replicate some findings in Roland Fryer and Steven Levitt’s well-known 2004 paper in the Review of Economics and Statistics, “Understanding the Black-White Test Score Gap in the First Two Years of School.” Create a .do file in Stata that includes all of the following items, plus responses to the questions, and submit your .do and log files, and also submit

the file containing your version of “Table 2” (see 7b below) before the beginning of the next class.

1) Locate the Stata file called “ Week 11 ECLSK Replication.” This dataset is an extract from the first two waves of the Early Childhood Longitudinal Study – Kindergarten Cohort (kindergarten and first grade). A link to the Fryer and Levitt paper is also provided on the Lecture Materials section. Take a few minutes to skim the Fryer and Levitt paper to gain an understanding of the research question, data, and methods.

2) Create a .do file that performs all tasks from this step forward. As always, give the .do file a descriptive header that explains what it contains and does, and provide comments throughout explaining what you are doing in each step. [4 points]

a. Set your working directory and open the Week 11 ECLSK replication file.

b. Rename all variables to their lowercase version, for ease of use.

3) ECLS-K used a complex sampling design that should be accounted for in this analysis. There is an excellent slideshow online created by NCES that explains the design:

https://nces.ed.gov/training/datauser/ECLS-K_04/assets/ECLK_04_slides.pdf.

a. Skim the slides linked above to find out more about the weights, cluster unit, and

strata. We will use Taylor Series Linearization as our method for variance calculation. Find the section on standard error calculations using Taylor Series Linearization. We will use the weight variable for the 1st and 2nd panel waves called bycw0. What is the PSU variable that accompanies this weight? What is the strata variable? (Hint: all these variable names begin with bycw.

b. Use svyset to indicate the design variables (PSU,weight, and strata). [2 points]

4) Create new variables as described below. In each case, make sure the new variable is coded as missing whenever the source variable is missing. Missing values are usually coded as negative numbers, except for normalized variables that have legitimate negative values (such as SES). When in doubt,display a table of variable values to determine whether negative values are “real” or not. Note that there may also be missing values coded using the Stata missing code “ .”

a. Create anew race variable called race_v2with five categories: white, black, Hispanic,

Asian/Pacific Islander,other. Assign this new variable value labels. Do a cross-tab of your new against your old variable to ensure the new variable has been coded (and labeled) properly. Use tabulate to get the relative frequencies of your new race variable, and run this with and without the svy prefix. Are there any racial/ethnic groups that appear to have been oversampled? How do you know? [4 points]

b. Create a new mother’sage at first birth variable (momage) that is identical to the original (p1hmafb). [1 point]

c. Using mother’sage at first birth (p1hmafb) create a dummy variable (teen) that equals 1 if the mother was a teenager at the time of her first birth (age 19 or under), and equals 0 otherwise. [1 point]

d. Using mother’sage at first birth (p1hmafb) create a dummy variable (thirties) that equals 1 if the mother was aged 30 or older at the time of her first birth, and equals 0 otherwise. [1 point]

e. Using gender, create a dummy variable (female) that equals 1 if the child is female, and equals 0 otherwise. [1 point]

f. Using p1wic, create a dummy variable (wic) that equals 1 for mothers who

participated in WIC, and equals 0 otherwise. (WIC is the Women, Infants, and Children Supplemental Nutrition Program for low-income families). [1 point]

g. Using p1chlboo, create a new variable books for the number of child’s books in the home, and a books squared version books2. Divide the latter by 1000. (Dividing by 1,000 simply scales the term so that the regression coefficient is more legible.) [2 points]

h. Using the IRT test score variables for the fall of kindergarten (c1r4rscl and

c1r4mscl), create z-scores formathand reading that have a mean of 0 and standard deviation of 1. [2 points]

Note: you should ensure the negative test score values are set to missing prior to this step. Otherwise, they will be improperly used in the z-score calculation.

i. There are two childbirthweight variables in the dataset that can be combined into one. One ends in “p” and is the “pounds” component of birthweight, and the other ends in “o” and is the “ounces” component of birthweight. For example, if the first is 7 and the second is 14, the child weighted 7 lbs. 14 ounces at birth. Use these together to create a new birthweight variable that is the child’s weight in ounces. Then divide this variable by 10. (Again dividing by 10 here simply rescales to make the regression coefficient more legible.) [2 points]

Note: you should ensure the negative values for pounds and ounces are set to missing prior to this step. Otherwise, they will be improperly used in the calculation.

j. Finally, create a new ses variable that is identical to the original (wksesl). [1 point]

k. Now that you have created these new variables, replace missing values with zero rather

than leaving them missing. (Exception: do not do this for the test score variables or race). The reason for doing this is to avoid losing observations with missing values in the regression analysis. Before replacing missing values with zero, generate a flag (dummy variable) for each variable that equals 1 if the variable is missing, and equals 0 otherwise. Pro tip: a clever naming convention can make it easier to include missing value flags in a regression: e.g., the variables wic and wic_miss can be included in a regression by typing wic*. [4 points]

5) Fryer and Levitt exclude from their analysis any student who is missing data on race or age (the latter is r1_kage). Use your new race variable and r1_kage to drop any observation that is missing either. How many cases are dropped? How many are left? [2 points]

6) Because of the dropped cases, one of the strata is left with only one cluster. This is problematic, as the svy commands will not work properly with only one PSU in a stratum. Use svydescribe to figure out which stratum has only one PSU, and replace the number of the “singleton” stratum with one of the adjacent strata numbers. [4 points]

7) Replicate Table 2: Replicate the regressions shown in columns 1-4 and 6-9 in Table 2. All of these regressions pertain to test scores in the fall of kindergarten. [8 points]

a. Be sure to include svy: before every regression command, and include the missing value flags created in part 4k as additional regressors anytime you include a regressor that corresponds with the flag. For example, if wic is one of your explanatory variables, also include the missing value flag wic_miss.

C1	Fall of kindergarten
C2	Spring of kindergarten
C3	Fall of 1st grade
C4	Spring of 1st grade
C5	Spring of 3rd grade
C6	Spring of 5th grade
C7	8th grade
C8	Spring of 5th grade

辅导 APSTAT.GE.2110 AP PLIED STATISTICS: USING LARGE DATABASES IN AP PLIED RESEARCH Problem Set 9辅导 留学

辅导 APSTAT.GE.2110 AP PLIED STATISTICS: USING LARGE DATABASES IN AP PLIED RESEARCH Problem Set 9辅导留学