BUSINESS SCHOOL
QBUS6810
Statistical Learning and Data Mining
Semester 1, 2021
Group Project: Airbnb Pricing Predictions
1. Key information
Required submissions:
• Team responsibilities outline (one pdf file per group; Canvas submission tool will be
made available in Week 9; due by the end of the day on May 17)
• Written report (one pdf file per group) • Kaggle predictions (via www.kaggle.com, please see Section 5 for more information) • Python code (one file per group).
Submission instructions for the report and the code will be posted on Canvas in Week 12.
Deadline for submitting the written report and the code is Friday, June 4 at 5PM.
Weight: 30% of your final grade.
Groups: Complete the assignment in groups of four or five students. Make sure to sign into
your group on Canvas: those groups will be used for identification and assessment purposes.
Length: Your written report should have a maximum of 15 pages (single spaced, 11pt, cover
page and references not counted towards the maximum).
Marking and key rules: • A separately posted rubric indicates the marking criteria for the report.
• Please read the requirements for each part of the assignment carefully.
• Please follow any further instructions announced on Canvas, particularly for submissions.
• You must use Python for this assignment. It is OK to use Excel for data manipulation,
however, this approach is generally not recommended due to its inefficiency.
• The predictions on Kaggle must come from your own analysis in Python. An examination
of some of the code will be conducted for verification purposes.
Page 1 of 5
BUSINESS SCHOOL
2. Problem description
Airbnb (www.airbnb.com) is a global platform that runs an online marketplace for renting and
leasing short-term lodging. It is interested in developing a pricing service for its users that
will compute a recommended price based on the features of a listing. As a consultant working
for a data analytics company, you are approached by Airbnb to develop a model for predicting
nightly prices of Airbnb listings based on state-of-art techniques from statistical learning. The
focus of your analytics team is on the properties in Sydney, Australia.
You are provided with a training dataset containing detailed information on a number of
existing Airbnb listings in Sydney. As part of the contract, you are asked to write a report
according to the instructions given below. The client will use a test set to evaluate your work.
3. Understanding the data
A training dataset (train.csv) and a test dataset (test.csv) are posted on Canvas (the same files
are also posted on Kaggle). The test dataset omits the price values.
Data Description:
Each row corresponds to a separate Airbnb listing in Sydney. As a consequence of using real
data scraped from Airbnb, a detailed description of all the variables is not available. However,
the names of the variables are self-explanatory.
The first column in the data provides an identifier for each listing and is included to comply
with the Kaggle format. It should not be used as a predictor in the analysis. The response
variable, price, is the second column in the training dataset. It gives the price per night for
each listing in Australian Dollars (AUD). Variables security_deposit, cleaning_fee and
extra_people are also measured in AUD and correspond to surcharges. Variables latitude
and longitude specify the geographic location of each property. Several variables are
Boolean, with the word true recorded as “t” and false recorded as “f”.
Some of the listings have missing values under some of the variables. Note that in many cases
a missing value means that the corresponding characteristic does not apply to that particular
Airbnb listing. This is information, rather than lack of information, and you could make use of
this information in your analysis.
4. Written report
The purpose of the report is to describe, explain, and justify your solution to the client. You
can assume that the client is trained in business analytics, however, is not an expert in statistical
learning.
Page 2 of 5
BUSINESS SCHOOL
Suggested outline of the report:
1. Introduction: write a few paragraphs stating the business problem and summarising
your final solution and highlighting your key insights. Use plain English and avoid
technical language as much as possible in this section (it should be for a wide audience).
2. Data processing and exploratory data analysis: provide key information about the data,
discuss potential issues, and highlight interesting and important facts about the data
and the relationships among the variables that are useful for the rest of your analysis.
3. Feature engineering: describe and justify your process of feature engineering.
4. Methodology (model building): here you will focus on the three models as outlined
below (your rationale for choosing the models and why they make sense for the data,
description of how these models are fitted, interpretations of the estimated models in
the context of the business problem at hand). The description of the methods and
algorithms can be more technical than the rest of the report (however, please use your
own words in the description).
5. Validation scores from Kaggle (see requirements below) and comparison of the models.
6. Conclusions and final remarks (non-technical).
Requirements:
• Your report must provide the validation scores (those from the Public Leaderboard on
Kaggle) for five different sets of predictions, including your final model. These should
generally be your best performing models within the model requirements specified below.
You will need to make a submission on Kaggle (see Section 5 for instructions) to get each
validation score.
• The five sets of predictions should come from different statistical learning methods. At
least one of the five models should to be an interpretable linear model (OLS, Lasso, etc);
at least one should be an interpretable model specified by a single regression tree; at least
one should be an advanced tree-based model (bagging, random forests or boosting); and
at least one should be a model stack (or model average). • In the methodology section you will discuss three of the five models in detail (including
both the description of the methods/algorithms and the interpretation of the estimated
models). The remaining two models do not need to be discussed in detail (you can just
provide one brief descriptive sentence for each of them).
• One of the three models that you discuss in detail must be your final model; one of the
three models is required to be an interpretable linear model (OLS, Lasso, etc); and one is
required to be an interpretable model specified by a single regression tree. Please note
that the description of the methods/algorithms for the three models should take up at
most 3 pages. • You will pay special attention to and report on the relationship between the location and
the price, both during the exploratory data analysis and during the model interpretation.
You will comment on the patterns in pricing around Sydney and its constituent suburbs.
As part of feature engineering, you will create (and describe in the report) at least one new
location-related variable by using the existing variables and, if you wish, external
information.
Page 3 of 5
BUSINESS SCHOOL
5. Kaggle Competition
You will participate in the Kaggle competition that will be run on www.kaggle.com. This
competition will allow you to incorporate feedback into your model building process and
compare your performance with that of other groups. Participation in the competition is part
of the assessment, so please make sure that your final submission is correct. Your ranking in
the competition will typically not directly affect your marks (apart from the bonus marks and
the benchmark requirement, as explained below), however, we will assess whether your
participation represents a genuine effort to make good predictions and improve them (please
make sure to beat the “Benchmark” score on the Public Leaderboard).
You will need to create a Kaggle account, identifiable by your name, to access the competition
and make submissions. Please note that you can significantly simplify your registration with
Kaggle by using social logins (Facebook, Yahoo, Google) to sign in. Those options are available
on the Kaggle sign-in page. After you have created an account and logged into Kaggle, use
the following link to get to the competition page (you need to be logged in to get to the
competition page via the link):
https://www.kaggle.com/t/932020c58110783854baf5a0f6931377
On this page you will click on the “Join Competition” link, located in a dark box near the top
right corner of the page. After you accept the competition rules, you will have joined the
Kaggle competition for the group project.
Each group will need to create a team on Kaggle. The group leader can create a team by
joining the competition and then going into the “Team” tab, which will appear near the top of
the competition page. The leader can then invite other group members using their Kaggle
names (they need to first join the competition before they are able to be invited). Kaggle team
composition must be identical to that of the groups you formed on Canvas, and the team
number must match the group number. Each student in the group is required to sign up and
be identifiable as a member of a Kaggle team.
Kaggle randomly splits (just once) the listings in the test.csv file into validation (30%) and test
(70%) cases, but you will not know which ones are which. When you make a submission during
the Kaggle competition, you get a score equal to the RMSE computed on the validation
listings. These scores are displayed on the “Public Leaderboard” and provide an ongoing
ranking of teams. You can use the scores of your submissions to help you select the best
predictive model.
You will need to manually select one of your Kaggle submissions to be used as final at the
end of the competition. Once the competition is over, Kaggle will rank teams’ final
submissions based on the test cases only, and those will be displayed on the “Private
Leaderboard”. Your goal is to do as well as possible on the Private Leaderboard at the end
of the competition, so please be careful not to overfit the validation cases in an attempt to
improve your public ranking. Please note that the competition ends at 4PM on June 4, which
is exactly 1 hour before the due time for the assignment report.
Page 4 of 5
BUSINESS SCHOOL
Real world relevance:
The ability to perform in a Kaggle competition is highly valued by employers. Some employers
go as far as to set up a Kaggle competition just for recruitment.
Bonus marks:
The five teams with the best performance on the Private Leaderboard will receive bonus
marks for the assignment (with the total Group Project score capped at 100). The best
performing team will receive 10 bonus marks, the second team will get 8 marks, the third will
get 6 marks, the fourth will get 4 marks, and the fifth will get 2 marks (however, the maximum
score will remain at or below 100). Please note that your choice of the final model must be
well justified in the report, and the corresponding Kaggle predictions must come from your
own analysis in Python. An examination of the code will be conducted for verification
purposes. Your code is required to reproduce the winning Kaggle predictions.