讲解CSCI 4146、辅导Java，Python程序设计、Data编程讲解解析C/C++编程|讲解R语言编程

CSCI 4146 - The Process of Data Science - Fall 2020
Assignment 1
The submission must be done through Brightspace.
Due date and time as shown on Brightspace under Assignments.
● To prepare your assignment solution use the assignment template notebook available
on Brightspace.
● The detailed requirements for your writing and code can be found in the evaluation rubric
document on Brightspace.
● Questions will be marked individually with a letter grade. Their weights are shown in
parentheses after the question.
● Assignments can be done by a pair of students, or individually. If the submission is by a
pair of students, only one of the students should submit the assignment on Brightspace.
● We will use plagiarism tools to detect any type of cheating and copying (your code and
PDF).
● Your submission is a single Jupyter notebook and a PDF (With the compiled results
generated by your Jupyter notebook). File names should be:
○ A1--.ipynb
○ A1--.pdf
● Forgetting to submit both files results in 0 markings for both students.
In this assignment, you will need to build a model to predict the price of an Airbnb listing.
Link for the dataset https://www.kaggle.com/airbnb/boston
1. Data understanding and preprocessing (0.1)
a. Build the data quality report
b. Identify data quality issues and build the data quality plan
c. Preprocess your data according to the data quality plan
d. Answer the following questions:
i. What is the neighbourhood with the highest average rating?
ii. What are the major characteristics of this neighbourhood (e.g., type of
listing, host rating, etc)?
2. Spatial data (0.2)
a. Plot listings on the city map with different colours corresponding to the listing’s
neighbourhood
b. Mark the “State station” (lat, long = 42.3570174,-71.071191) subway station on
the city map.
c. Plot the distance between the closest and most distant listings to State station.
3. Build a model to forecasts the price of a listing (0.7)
a. Explain what is the task you’re solving (e.g., supervised x unsupervised,
classification x regression x clustering or similarity matching x etc)
b. Use a feature selection method to select the features to build a model. Include in
the resulting dataset the distance from State station and exclude the free-text
(such as descriptions, reviews) and rating features.
c. Select the evaluation metric. Justify your choice.
d. Build a baseline model
i. Perform hyperparameter tuning if applicable.
ii. Tran and evaluate your model
iii. How do you make sure not to overfit?
iv. Plot learning curve
v. Analyze the results
e. Build a candidate final model (can be repeated for multiple models but only
include the final selection)
i. Perform hyperparameter tuning if applicable.
ii. Tran and evaluate your model
iii. How do you make sure not to overfit?
iv. Plot learning curve
v. Analyze the results
f. Compare the two models with a statistical significance test. Use a box-plot to
visualize your comparison.
g. The above question explicitly excludes the rating attributes. It corresponds to
modelling the task of a host putting a new listing on Airbnb that does not have
any ratings yet. A related task is of a traveller who wants to check whether a
listing requests a fair price, including the ratings of that listing. Include the
rating(s) in the dataset and see if the new final model performs better than
without the rating attributes.