APANPS4335: Machine Learning
Directions: please submit your homework as a single attachment
on the Canvass class website.
I would like to get an idea of your interests in Machine Learning and your background.
Please tell me:
Your last degree program (B.A., M.A., E.D., Ph.D., etc) ? Your Major and year
Your mathematical pro ciency (Calculus)- Scale 1(Poor) -5(Excellent)
Your mathematical pro ciency (Linear Algebra)- Scale 1(Poor) -5(Excellent)
Your statistical pro ciency (Probability, Statistical Distributions)- Scale 1(Poor) -5(Excellent)
Your statistical inference pro ciency (Regression, Logistic Regression)- Scale 1(Poor)
-5(Excellent)
Your R programming pro ciency (Plots, Functions, Rmarkdown), Scale 1(Poor) -
5(Excellent)
We will be using Rstudio in this class to implement the algorithms we learn in class. The
goal of this assignment is to get you pro cient in the basics of R, such as writing scripts
and plotting. If you are having trouble, you can nd help by searching the internet (often
searching for the speci c error message is helpful), reading Data Mining with R by Luis Torgo
or R in a Nutshell by Joseph Adler, asking your friends, and coming to o ce hours. The
computing homework needs to be submitted with your name and Uni# with Rmarkdown
le and a pdf with the code and explanation. If you have di culty with Rmarkdown, please
watch https://www.youtube.com/watch?v=7BXX7TaF1Gs for a quick intro.
1. (4 Points) Install R, Rsudio & Rmarkdown. In the command window, type version and
hit enter.
2. (6 Points) We are going to make a function. Go to \File" and then select \New Docu-
ment." This is where you will write a script. that takes a vector as an input and outputs
the sum of the odd elements. For both parts of this question, include a copy of each
function,
Function.Name 75]=’Yes’
baseball$high rbi< factor(baseball$high rbi)
i. What is the ’class’ of your new predictor variable ’high rbi’?
j. What are the dimensions of the baseball database now?
k. What percentage of players had more than 75 RBIs in the 1986-1987 seasons?
l. Use R’s plot () function to get side by side boxplots of Salary and high rbi (Salary
should be your vertical (y) axis, high rbi your horizontal (x) axis?Note: A baseball
player who can produce a lot of RBIs (runs batted in) is valuable for a baseball team.
So is a player who scores a lot of runs. Now we are going to create another new
qualitative variable for players who have both of these skills, which we will call ?elite?.
Your new qualitative variable should have two levels: ?elite? and ?not elite?.
m. What percentage of players were high RBI (?Yes?) and also scored more than80 Runs?
(again, Torgo section 1.2.27 on subsetting might be helpful).
n. Use R?s plot () function to generate side-by-side boxplots of elite players and non-elite
players withrespect to Salary.
o. Use R’s hist() function to produce four histograms with di ering numbers of bins and
colors for four quantitative variables (Hits, RBI, HmRun, and Runs). You may nd
the R command par(mfrow=c(2,2)) useful: it will divide the print window into four
regions so that four plots can be made simultaneously. Modifying the arguments to
3
this function will divide the screen in other ways. You can start like this.
par(mar=c(1,1,1,1))
par(mfrow=c(2,2))
p. Continue exploring the data, and provide a brief summary of what you discover.
7. Baseball Part 2 (15 points)
a. What is the range of your rst 7 quantitative predictors? Here you can use R?s range()
function. (try using the sapply function, too, which will call up all seven ranges at
once.)
b. What is the mean and standard deviation of each of these quantitative predictors?
c. Now remove the 20th through 60th observations. What is the range, mean, and stan-
dard deviation of each of your seven predictors in the subset of the data that remains?
(This is roughly 80% of the original data.) How would you characterize the di erences
of the before and after sets?
d. Using the full data set, investigate the seven predictors graphically, using scatterplots
or other tools of your choice. Create some plots highlighting the relationships among
the predictors. Comment on your ndings.
e. Suppose that we wish to predict salary on the basis of the other variables. Do your
plots suggest that any of the other variables might be useful in predicting salary?
Justify your answer.