首页 > > 详细

辅导STAT 7008、csv数据报告辅导、解析LCID值分析、CVS/Matlab程序数据分析讲解 辅导Python编程|辅导Web开发

STAT 7008 - Assignment 1
Due Date by 5 Oct 2018 numpy panda
The use of numpy and pandas in this assignment are prohibited. You will
receive zero marks to solve problems in this assignment if you use the
mentioned packages.
Question 1
1. Please write codes to read the data file TrainingData.csv.
The first row is the header (variable names). Data are stored in
subsequent rows. csv
2. Determine the number of variables and the number of records in this
dataset.
3. Store the variable names in a list. list
4. Determine if there is any missing values in the data set. If yes, please
report the total number of missing values.
5. Find the number of distinct LCID in the data set. LCID
6. Find the variable with the most missing values.
7. Convert the variable hour_id to datetime format.
8. What is the time duration of the entire data set?
9. Determine the number of records per day.
10. Use the median method in the statistics package (from statistics
import median) or else, do the followings:
(a) Divide the entire data set by distinct value of LCID.
(b) For each distinct LCID value, determine the median of each
variables in the divided data set.
(c) Package the result in (b) in a dictionary.
11. Determine the number of Complaint cases and Non-complaint cases
in the entire data set.
12. Determine the top 10 LCIDs with the most complaint cases.
13. Calculate the median value per day per each variable in the entire data
set.
14. Use the first 5 digits of the LCID values to define a new variable Region.
15. Determine the region with the most complaint cases found in the data
set. Question 2
The objective of this part is to employ the provided data sets u.data and
u.item to develop a movie recommender.
u.data consists of user ratings on a set of movies. The last column
corresponds to time stamps relative to 1st Jan 1970. Column names for
u.data are ["userid","movieid","rating","timestamp"].
u.item represents the set of movies defined in u.data. The column names for
u.item are ["movieid", "title", "release", "url", "unknown", "Action",
"Adventure", "Animation", "Children", "Comedy", "Crime", "Documentary",
"Drama", "Fantasy", "Film-Noir", "Horror", "Musical", "Mystery", "Romance",
"Sci-Fi", "Thriller", "War", "Western"].
You can also download the two data files from
http://grouplens.org/datasets/movielens/.
1. Import the two data files with an appropriate separator. Do the
followings:
(a) Set the timestamp variable to its datetime format using
datetime.fromtimestamp() method.
(b) Add leading zeros to the movieid and userid with zfill(4) method,
e.g. "0023".
2. Remove movies with title = 'unknown'.
3. Find the average ratings and the number of reviews for all movies in
u.item.
4. Write a function to list the top n (e.g. 10) rated movies, title names
and their number of reviews.
5. Considering that a movie with a higher number of reviews should have
given a higher weight, we adjust the average rating formula by
incorporating c hypothetical users. These users rate each movie with
rating m. Use c = 59 and m = 3, write a function to list the top n rated
movies, title names and their number of reviews using the adjusted
average formula. Compare the listing with that found in question 4.
Which one is more reasonable?6. For two distinct users A and B, find the set of movies common to both
users, that is the set of movies both users have given ratings. Apply
the Euclidean distance formula on the two sets of ratings to determine
a "distance" between user A and user B. Write a distance function with
userid of A and userid of B as input. The output of the function is
1/(1+d(A,B)), where d(A,B) is the distance between user A and user B.
7. Given a user, write a function to determine and output a list of
distances between the given user and others. Mark the distances with
their users.
8. Write a function with a given user and a given movie. If the movie was
rated by the user, output the rating provided. If the movie was not
rated by the user, output the weighted average of the ratings of all
other users weighted by their distances with the given user.
9. Hence, given a user, write a function to suggest 10 movies.

联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!