首页
编程语言
数据库
网络开发
Algorithm算法
移动开发
系统相关
金融统计
人工智能
其他
首页
>
> 详细
CMPT 459.1-19讲解、辅导Python程序语言、讲解Java,c/c++ 调试Matlab程序|调试Web开发
---
title: "CMPT 459.1-19. Programming Assignment 1"
subtitle: "FIFA 19Players"
author: "Name - Student ID"
output: html_notebook
---
### Introduction
The data has detailed attributes for every player registered in
the latest edition of FIFA 19 database, obtained scraping the
website “sofifa.com”. Each instance is a different player, and
the attributes give basic information about the players and
their football skills. Basic pre-processing was done and Goal
Keepers were removed for this assignment.
Please look here for the original data overview and attributes’
descriptions:
- https://www.kaggle.com/karangadiya/fifa19
And here to get a better view of the information:
- https://sofifa.com/
---
### First look
**[Task 1]**: Load the dataset, completing the code below (keep
the dataframe name as **fifa**)
```{r}
# Loading
fifa <- read.csv("fifa.csv")
```
**[Checkpoint 1]**: How many rows and columns exist?
```{r}
cat(ifelse(all(dim(fifa) == c(16122, 68)), "Correct results!",
"Wrong results.."))
```
---**[Task 2]**: Give a very brief overview of the types of each
attribute and their values. **HINT**: Functions *str*, *table*,
*summary*.
```{r}
# Overview
str(fifa)
```
**[Checkpoint 2]**: Were functions used to display data types
and give some idea of the information of the attributes?
---
### Data Cleaning
Functions suggested to use on this part: *ifelse*, *substr*,
*nchar*, *str_split*, *map_dbl*.
Five attributes need to be cleaned.
- **Value**: Remove euro character, deal with ending
"K" (thousands) and "M" (millions), define missing values and
make it numeric.
- **Wage**: Same as above.
- **Release.Clause**: Same as above.
- **Height**: Convert to "cm" and make it numeric.
- **Weight**: Remove "lbs" and make it numeric.
**[Task 3]**: The first 3 of the 5 attributes listed above that
need to be cleaned are very alike. Create only one function to
clean them the same way. This function should get the vector of
attribute values as parameter and return it cleaned, so use it
three times, each with one of the columns. **Encode zeroes or
blank as NA.**
```{r}
# Function used to clean attributes
library(stringr)
attr_fix <- function(attribute){
cleaned_attribute = str_split(attribute, gsub, pattern='€',
replacement='')
return(cleaned_attribute)}
# Cleaning attributes
fifa$Value <- attr_fix(fifa$Value)
fifa$Wage <- attr_fix(fifa$Wage)
fifa$Release.Clause <- attr_fix(fifa$Release.Clause)
```
**[Checkpoint 3]**: How many NA values?
```{r}
cat(ifelse(sum(is.na(fifa))==1779, "Correct results!", "Wrong
results.."))
```
---
**[Task 4]**: Clean the other two attributes. **Hint**: To
convert to "cm" use http://www.sengpielaudio.com/calculatorbodylength.htm.
```{r}
# Cleaning attribute Weight:
```
```{r}
# Cleaning attribute Height:
```
**[Checkpoint 4]**: What are the mean values of these two
columns?
```{r}
cat(ifelse(all(c(round(mean(fifa[,8]),4)==164.1339,
round(mean(fifa[,7]),4)==180.3887)), "Correct results!", "Wrong
results.."))
```
---
### Missing Values
**[Task 5]**: What columns have missing values? List them below
(Replace
). Impute (so do not remove) values missing (that is all NA found) and explain the reasons for the
method used. Suggestion: MICE imputation based on random
forests .R package mice: https://www.ncbi.nlm.nih.gov/pmc/
articles/PMC3074241/, Use *set.seed(1)*. **HINT**: Remember to
not use "ID" nor "International.Reputation" for the imputation,
if MAR (Missing at Random) is considered. Also later remember to
put them back to the "fifa" dataframe.
Columns with missing values:
-
-
- ...
```{r}
# Handling NA values
```
```{r}
# Putting columns not used on imputation back into "fifa"
dataframe
```
**[Checkpoint 5]**: How many instances have at least one NA? It
should be 0 now. How many columns are there? It should be 68
(remember to put back "ID" and "International.Reputation").
```{r}
cat(ifelse(all(sum(is.na(fifa))==0, ncol(fifa)==68), "Correct
results!", "Wrong results.."))
```
---
### Feature Engineering
**[Task 6]**: Create a new attribute called "Position.Rating"
that has the rating value of the position corresponding to the
player. For example, if the player has the value "CF" on the
attribute "Position", then "Position.Rating" should have the
number on the "CF" attribute. **After that, remove the
"Position" attribute from the data**.```{r}
# Creating the attribute "Position.Rating"
```
```{r}
# Removing the attribute "Position"
```
**[Checkpoint 6]**: What's the mean of the "Position.Rating"
attribute created? How many columns are there in the dataframe?
It should be 68 (remember to remove "Position").
```{r}
cat(ifelse(all(c(round(mean(fifa$Position.Rating),5) ==
66.87067, ncol(fifa)==68)), "Correct results!", "Wrong
results.."))
```
---
### Dimension Reduction
**[Task 7]**: Performe PCA (Principal Component Analysis) on the
columns representing ratings of positions (that is, attributes:
LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM,
RM, LWB, LDM, CDM, RDM, RWB, LB, LCB, CB, RCB, RB). Show the
summary of the components obtained. **Keep the minimum number of
components to have at least 98.50% of the variance explained by
them.**. Remove the columns used for PCA. **HINT**: Function
*prcomp*, remember to center and scale.
```{r}
# Perform PCA
# Show Summary
```
```{r}# Put the components back into "fifa" dataframe
# Remove original columns used for PCA
```
**[Checkpoint 7]**: How many columns exist in the dataset? It
should be 45.
```{r}
cat(ifelse(ncol(fifa)==45, "Correct results!", "Wrong
results.."))
```
**[Bonus]**: Use the code below to see which columns influenced
the most each component graphically. Replace "fifa.pca" with the
object result from the use of *prcomp* function.
```{r}
library(factoextra)
fviz_pca_var(fifa.pca,
col.var = "contrib", # Color by contributions to
the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
```
---
### Binarization
**[Task 8]**: Perform binarization on the following categorical
attributes: "Preferred.Foot" and "Work.Rate". **HINT**: R
package "dummies", function *dummy.data.frame*.
```{r}
# Binarize categorical attributes
```
**[Checkpoint 8]**: How many columns exist in the dataset? It
should be 54.
```{r}
cat(ifelse(ncol(fifa)==54, "Correct results!", "Wrong
results.."))```
---
### Normalization
**[Task 9]**: Remove attribute "ID" from "fifa" dataframe, save
attribute "International.Reputation" on vector named "IntRep"
and then also remove "International.Reputation" from "fifa"
dataframe. Perform z-score normalization on "fifa", except for
columns that came from PCA. Finally combine the normalized
attributes with those from PCA saving on "fifa" dataframe.
**HINT**: Function *scale*.
```{r}
# Normalize with Z-Score
```
**[Checkpoint 9]**: How many columns exist in the dataset? It
should be 52. What's the mean of all the means of the
attributes? Should be around zero.
```{r}
cat(ifelse(ncol(fifa)==52, "Correct results!", "Wrong
results.."))
```
---
### K-Means
**[Task 9]**: Perform K-Means for values of K ranging from 2 to
15. Find the best number of clusters for K-means clustering,
based on the silhouette score. Report the best number of
clusters and the silhouette score for the corresponding
clustering (Replace
below). How strong is the
discovered cluster structure? (Replace
below) Use
"set.seed(1)". **HINT**: Function *kmeans* (make use of
parameters *nstart* and *iter.max*) and *silhouette* (from
package "cluster").
```{r}
# K-Means and Silhouette scores```
Results found:
- Best number of clusters:
- Silhouette score:
- How strong is the cluster?
**[Checkpoint 9]**: Are there silhouette scores for K-Means with
K ranging from 2 to 15? Were the best K and correspondent
silhouette score reported?
---
**[Task 10]**: Perform K-means with the K chosen and get the
resulting groups. Try out several pairs of attributes and
produce scatter plots of the clustering from task 9 for these
pairs of attributes. By inspecting these plots, determine a pair
of attributes for which the clusters are relatively wellseparated
and submit the corresponding scatter plot.
```{r}
# K-Means for best K and Plot
```
**[Checkpoint 10]**: Is there at least one plot showing two
attributes and the groups (colored or circled) reasonably
separated?
---
### Hierarchical Clustering
**[Task 11]**: Sample randomly 1% of the data (set.seed(1)).
Perform hierarchical cluster analysis on the dataset using the
algorithms complete linkage, average linkage and single linkage.
Plot the dendrograms resulting from the different methods (three
methods should be applied on the same 1% sample). Discuss the
commonalities and differences between the three dendrograms and
try to explain the reasons leading to the differences (Replace
the
below).
```{r}# Sample and calculate distances
```
```{r}
# Complete
```
```{r}
# Average
```
```{r}
# Single
```
Discussion:
-
**[Checkpoint 11]**: Does the discussion show commonalities and
differences between the three dendrograms and explain the
differences?
---
### Clustering comparison
**[Task 12]**: Now perform hierarchical cluster analysis on the
**ENTIRE dataset** using the algorithms complete linkage,
average linkage and single linkage. Cut all of the three
dendrograms from task 11 to obtain a flat clustering with the
number of clusters determined as the best number in task 9.
To perform an external validation of the clustering results, use
the vector "IntRep"" created. What is the Rand Index for the
best K-means clustering? And what are the values of the Rand
Index for the flat clusterings obtained in this task from
complete linkage, average linkage and single linkage? Discuss the results (Replace
below). **HINT**: Function
*cluster_similarity* from package "clusteval".
```{r}
# Hierarchical Clusterings (Complete, Average and Single)
```
```{r}
# Flat Clusterings
```
```{r}
# Cluster Similarities
```
Discussion:
-
**[Checkpoint 12]**: Does the discussion include relevant
comparison of the clusters and makes sense?
。grid200,20pixel。
gl.bufferData(gl.ARRAY_BUFFER, 200*4*8, gl.STATIC_DRAW);allocate
buffer。sizesquare。200,4
vertex(triangle-stripes),vertex 8 bytes,
联系我们
QQ:99515681
邮箱:99515681@qq.com
工作时间:8:00-21:00
微信:codinghelp
热点文章
更多
讲解 dts206tc applied linear...
2024-04-17
辅导 apstat.ge.2110 ap plied...
2024-04-17
辅导 econ 1103 002 & 013...
2024-04-17
辅导 busness 114 – accounti...
2024-04-17
讲解 rest0004: property inve...
2024-04-17
讲解 swen30006 software mode...
2024-04-17
讲解 interconnected worlds: ...
2024-04-17
讲解 problem set #1 – econ ...
2024-04-17
讲解 ad 685 term project辅导...
2024-04-17
辅导 math 194: problem set #...
2024-04-17
辅导 ceg 5301 assignment 5 a...
2024-04-17
辅导 chin0085 final exam que...
2024-04-17
辅导 mth 214 life insurance ...
2024-04-17
辅导 comm1110 evidence-based...
2024-04-17
讲解 cen103 – solids and st...
2024-04-17
辅导 stats 320 – applied st...
2024-04-17
辅导 ac1103 – accounting i ...
2024-04-17
讲解 fit3175 usability - s1 ...
2024-04-17
辅导 bx2031 sp51 2024 invest...
2024-04-17
讲解 cs 20a: data structures...
2024-04-17
热点标签
cs 61b
cs360
fin 3080
ierg 4080
cs6238
cit 594
finm7406
hw6
elec9713
asb-2522
lit301
mcd4540
geog0030
125.330
biol0006
125.320
cs3334
fit2093
acct1101
110.309
masy1-gc
cs314
elec0048
gds104
mg5637
mso3610
math5905
fit2096
comp3411
07
eel4837
sehs4515
cpt s 321
asb2522 investment
ma214
co2104
mgmt2015
32516
math32051
econ1012
mark2052
dsci 525
comp3310
econ0019
comp30023
abmf3184
aps106
antc27
finm7401
itp122
tech2300
math3026
comp9024
cs 2550
is2022
fit1047
36318
cao107
cs 211
ics4u
2xc3
en.540.635
4qqmn506
finn3081
phys10362
sta601
ec481e
math5165
csi 2120
el1205
comp7250
ecos3013
beam065
swen90004
info1113
sehh2042
comp2051
csc325
mne 6130
ai6126
inu1111
fit3152
finm7409
qbus2820
fins5516
cpt106
info6001
ecs150
is61x6
cse115
seng6110
bus265
cpts260
mphy0009
csc306
eco2011
ee3004
st332
idepg001
isom3028
eece 6083
ceg5304
mcd4700
eecs 493
eg25h4
38173
elc5216
infs6071
lubs5996m
7ssmm803
glbh0031
phys1120
comp52715
eeb240
soci3403
comp3334
psyc3241
fin570
218.323
lng310
rim3352
bio206
bu.450.760
math3836
cmns3490
iy5610/4610
cpt304
ac6105
ac.f633
asb-3525
lng206
acfi302
cs 1501
cpt408
bust10134
soc100
infr11199
csci 2122
comp9334
csc1002
etf5650
eco202y1y
acct608
apm236
gbsh0007
efim20005
actl5105
elec 292
sdsc4026
ds2500
dts205tc
cs 455
swen20003
cs202
158.739-2024
eco2101
benv0149
ifb001
mth6142
phy344
infs3202/7202
fc308
st332/st409
eeng20004
cmt313
chc4008
6g6z0041
1cwk100
ec204
eng2008
stat512
comp20007
comp3331/9331
comp2111
sehh2239
envm3115/7205
fin3020
finn1037
mat187-written-homework
comp9311
csc3150
comp809
chc5223
bem2031
159.341
math3001
csme10003
pa3
ent204tc
finm1416
ib2680
phl349
culs30004
psych204-24a
csc 330
mecm30016
eap115
comp 330
aic2100
comp s380f
联系我们
- QQ: 99515681 微信:codinghelp
© 2024
www.7daixie.com
站长地图
程序辅导网!