讲解STAT 385、辅导data dimension、讲解R编程设计、辅导R语言解析Haskell程序|辅导Python程序

STAT 385 Fall 2019 - Homework Assignment 03
Due by 12:00 PM 10/13/2019
The Homework Problems
Below you will find problems for you to complete as an individual. It is fine to discuss the homework problems with classmates, but cheating is prohibited and will be harshly penalized if detected.

1. Create a custom volume measurement function that will convert the following units of volume:
13 imperial (liquid) cups to cubic inches.

2.5 US customary (liquid) gallons to fluid ounces.

3 US customary (dry) teaspoons to milliliters.

75 (dry) liters to imperial quarts.

2. Do the following:
create a 25 ×× 25 matrix with autoregressive structure with p=9/10p=9/10, every element in the matrix should be equal to (9/10)|i−j|(9/10)|i−j| where i is the row index and j is the column index. Report the row and column sums of this matrix.

run the commands:

set.seed(13)
x <- c(10, 10)
n <- 2
Create a while loop which concatenates a new mean-zero normal random variables that have σ=2σ=2 to the existing vector x at every iteration. Have this loop terminate when the standard error (estimated standard deviation of x divided by n−−√n) is lower than 1/10. Report nn.

repeat part b and report nn after running the commands:
set.seed(13)
x <- rnorm(0, sd = 2)
n <- 1
The sample size required to get a standard error lower than 1/10 was smaller in part c than it was in part b. We would expect for this to be the case before we ran any code. Why?
3. Do the following (Efron’s bootstrap):
load in the dataset dataHW3.csv

call the first column of this dataset x. Compute the statistic (mean(x) - 10)/se(x) where se is shorthand for standard error (see the previous problem for the definition of standard error).

now resample the elements of x with replacement 10000 times, and compute and store the statistic (mean(x’) - mean(x))/se(x’) at each iteration where x’ corresponds to the resample of the elements of x. Call the vector which contains these reasampled statistics `resamples’. Use an apply function for this part.

run the command `hist(resamples, breaks = 20)’ to make a histogram, include this histogram in your assignment.

repeat parts b through d with respect to the second column of dataHW3.csv. Would you say that the test statistic calculated from each column has the same distribution?

4. Do the following:
make sure you have the dataset WPP2010.csv (your file location may need to change) and then run the commands:
# load in UN dataset and remove irrelevant variables
options(warn=-1)
WPP2010 <- read.csv("WPP2010.csv", header = TRUE)
colnames(WPP2010)[3] <- c("region")
colnames(WPP2010)[6] <- c("year")
colnames(WPP2010)[7:17] <- paste("age", 0:10 * 5, sep = "")
WPP2010 <- WPP2010[, c(3, 6, 11, 12)]

# restrict attention to countries of interest
countries <- c("Canada", "Mexico", "United States of America")

# obtain population data for all countries for all years
dataset <- WPP2010[WPP2010[, 1] %in% countries, ]
dataset[, 3] <- as.numeric(levels(dataset[, 3]))[dataset[, 3]]
dataset[, 4] <- as.numeric(levels(dataset[, 4]))[dataset[, 4]]
dataset[, 3:4] <- dataset[, 3:4] / 1000

# get population dataset for this analysis corresponding to the
# Census years
dataset.years <- dataset[dataset[, 2] %in%
c("1960", "1970", "1980", "1990", "2000", "2010"), ]
dataset.years[, 2] <- factor(dataset.years[, 2])
dataset.years.list <- split(dataset.years, f = as.factor(dataset.years[, 2]))
pops <- unlist(lapply(dataset.years.list, function(x) sum(x[, 3:4])))
The code in part a is partially commented. Add comments to all remaining lines of code to make the script clear.

Determine the proportion of mainland North American males aged 20-29 that lived in 1970 or before.

5. With the tidyverse package and its functions, do the following with the CCSO Bookings Data:
show only the 2012 bookings for people ages 17-23 years old not residing in Illinois and show the data dimension

show only the bookings for people who have employment status as “student” booked after the year 2012 residing in Danville and show the data dimension

show only the bookings for Asian people residing in the cities of Champaign or Urbana and show the data dimension

repeat parts a-c using only pipe operators

Select in-class tasks
Completion of select in-class tasks will be worth 1 point and will be graded largely by completion. Obvious errors and incomplete work will recieve deductions. Problems 3-5 are directly copied from your notes. Problems 1-2 are copied from the notes with minor alterations. In these problems I ask that you display the first 5 rows of the dataset instead of the entire dataset.

Load in the CCSO dataset, discover 3 factor (or categorical) variables and 3 numeric variables. Show the first 5 rows of this dataset with only those 6 variables.

Rename one of the factor variables to a name that is either easier to understand than the original variable name. Show the first 5 rows of the dataset with all variables such that the variable with the new name is the first column in the dataset.

Write 3 separate loops: a for loop, while loop, and repeat loop that give the same result. The result should be the cumulative sum of Days in jail among Black people whose Arrest Ages 18-24 with Student as Employment status within the CCSO Bookings Data.

Here are some images of R code. Read the code, debug it if necessary, and judge it on its efficiency and correctness. Decide on which set of code is better and improve the better one.

Using the vector y below
set.seed(385)
y <- rnorm(100)
Use the which.min and which.max functions to dispay the index corresponding to the minimum and maximum elelments of y.

Do the which.min and which.max functions work? (try: max(y) == y[which.max(y)]).

Use the which function and the length function to report the proportion of the elements of y that are greater than 0.

Discuss why the proportion in part c is close to 0.5. Hint: What is the mean of the normal distribution that generated the elements in y?

Create a factor variable with 50 values of A and 50 values of B, and name this factor variable trt.

Create a data frame consisting of x and trt.