QBUS6810 Airbnb Pricing Predictions

BUSINESS SCHOOL

QBUS6810

Statistical Learning and Data Mining

Semester 1, 2021

Group Project: Airbnb Pricing Predictions

1. Key information

Required submissions:

• Team responsibilities outline (one pdf file per group; Canvas submission tool will be

made available in Week 9; due by the end of the day on May 17)

• Written report (one pdf file per group) • Kaggle predictions (via www.kaggle.com, please see Section 5 for more information) • Python code (one file per group).

Submission instructions for the report and the code will be posted on Canvas in Week 12.

Deadline for submitting the written report and the code is Friday, June 4 at 5PM.

Weight: 30% of your final grade.

Groups: Complete the assignment in groups of four or five students. Make sure to sign into

your group on Canvas: those groups will be used for identification and assessment purposes.

Length: Your written report should have a maximum of 15 pages (single spaced, 11pt, cover

page and references not counted towards the maximum).

Marking and key rules: • A separately posted rubric indicates the marking criteria for the report.

• Please read the requirements for each part of the assignment carefully.

• Please follow any further instructions announced on Canvas, particularly for submissions.

• You must use Python for this assignment. It is OK to use Excel for data manipulation,

however, this approach is generally not recommended due to its inefficiency.

• The predictions on Kaggle must come from your own analysis in Python. An examination

of some of the code will be conducted for verification purposes.

Page 1 of 5

BUSINESS SCHOOL

2. Problem description

Airbnb (www.airbnb.com) is a global platform that runs an online marketplace for renting and

leasing short-term lodging. It is interested in developing a pricing service for its users that

will compute a recommended price based on the features of a listing. As a consultant working

for a data analytics company, you are approached by Airbnb to develop a model for predicting

nightly prices of Airbnb listings based on state-of-art techniques from statistical learning. The

focus of your analytics team is on the properties in Sydney, Australia.

You are provided with a training dataset containing detailed information on a number of

existing Airbnb listings in Sydney. As part of the contract, you are asked to write a report

according to the instructions given below. The client will use a test set to evaluate your work.

3. Understanding the data

A training dataset (train.csv) and a test dataset (test.csv) are posted on Canvas (the same files

are also posted on Kaggle). The test dataset omits the price values.

Data Description:

Each row corresponds to a separate Airbnb listing in Sydney. As a consequence of using real

data scraped from Airbnb, a detailed description of all the variables is not available. However,

the names of the variables are self-explanatory.

The first column in the data provides an identifier for each listing and is included to comply

with the Kaggle format. It should not be used as a predictor in the analysis. The response

variable, price, is the second column in the training dataset. It gives the price per night for

each listing in Australian Dollars (AUD). Variables security_deposit, cleaning_fee and

extra_people are also measured in AUD and correspond to surcharges. Variables latitude

and longitude specify the geographic location of each property. Several variables are

Boolean, with the word true recorded as “t” and false recorded as “f”.

Some of the listings have missing values under some of the variables. Note that in many cases

a missing value means that the corresponding characteristic does not apply to that particular

Airbnb listing. This is information, rather than lack of information, and you could make use of

this information in your analysis.

4. Written report

The purpose of the report is to describe, explain, and justify your solution to the client. You

can assume that the client is trained in business analytics, however, is not an expert in statistical

learning.

Page 2 of 5

BUSINESS SCHOOL

Suggested outline of the report:

1. Introduction: write a few paragraphs stating the business problem and summarising

your final solution and highlighting your key insights. Use plain English and avoid

technical language as much as possible in this section (it should be for a wide audience).

2. Data processing and exploratory data analysis: provide key information about the data,

discuss potential issues, and highlight interesting and important facts about the data

and the relationships among the variables that are useful for the rest of your analysis.

3. Feature engineering: describe and justify your process of feature engineering.

4. Methodology (model building): here you will focus on the three models as outlined

below (your rationale for choosing the models and why they make sense for the data,

description of how these models are fitted, interpretations of the estimated models in

the context of the business problem at hand). The description of the methods and

algorithms can be more technical than the rest of the report (however, please use your

own words in the description).

5. Validation scores from Kaggle (see requirements below) and comparison of the models.

6. Conclusions and final remarks (non-technical).

Requirements:

• Your report must provide the validation scores (those from the Public Leaderboard on

Kaggle) for five different sets of predictions, including your final model. These should

generally be your best performing models within the model requirements specified below.

You will need to make a submission on Kaggle (see Section 5 for instructions) to get each

validation score.

• The five sets of predictions should come from different statistical learning methods. At

least one of the five models should to be an interpretable linear model (OLS, Lasso, etc);

at least one should be an interpretable model specified by a single regression tree; at least

one should be an advanced tree-based model (bagging, random forests or boosting); and

at least one should be a model stack (or model average). • In the methodology section you will discuss three of the five models in detail (including

both the description of the methods/algorithms and the interpretation of the estimated

models). The remaining two models do not need to be discussed in detail (you can just

provide one brief descriptive sentence for each of them).

• One of the three models that you discuss in detail must be your final model; one of the

three models is required to be an interpretable linear model (OLS, Lasso, etc); and one is

required to be an interpretable model specified by a single regression tree. Please note

that the description of the methods/algorithms for the three models should take up at

most 3 pages. • You will pay special attention to and report on the relationship between the location and

the price, both during the exploratory data analysis and during the model interpretation.

You will comment on the patterns in pricing around Sydney and its constituent suburbs.

As part of feature engineering, you will create (and describe in the report) at least one new

location-related variable by using the existing variables and, if you wish, external

information.

Page 3 of 5

BUSINESS SCHOOL

5. Kaggle Competition

You will participate in the Kaggle competition that will be run on www.kaggle.com. This

competition will allow you to incorporate feedback into your model building process and

compare your performance with that of other groups. Participation in the competition is part

of the assessment, so please make sure that your final submission is correct. Your ranking in

the competition will typically not directly affect your marks (apart from the bonus marks and

the benchmark requirement, as explained below), however, we will assess whether your

participation represents a genuine effort to make good predictions and improve them (please

make sure to beat the “Benchmark” score on the Public Leaderboard).

You will need to create a Kaggle account, identifiable by your name, to access the competition

and make submissions. Please note that you can significantly simplify your registration with

Kaggle by using social logins (Facebook, Yahoo, Google) to sign in. Those options are available

on the Kaggle sign-in page. After you have created an account and logged into Kaggle, use

the following link to get to the competition page (you need to be logged in to get to the

competition page via the link):

https://www.kaggle.com/t/932020c58110783854baf5a0f6931377

On this page you will click on the “Join Competition” link, located in a dark box near the top

right corner of the page. After you accept the competition rules, you will have joined the

Kaggle competition for the group project.

Each group will need to create a team on Kaggle. The group leader can create a team by

joining the competition and then going into the “Team” tab, which will appear near the top of

the competition page. The leader can then invite other group members using their Kaggle

names (they need to first join the competition before they are able to be invited). Kaggle team

composition must be identical to that of the groups you formed on Canvas, and the team

number must match the group number. Each student in the group is required to sign up and

be identifiable as a member of a Kaggle team.

Kaggle randomly splits (just once) the listings in the test.csv file into validation (30%) and test

(70%) cases, but you will not know which ones are which. When you make a submission during

the Kaggle competition, you get a score equal to the RMSE computed on the validation

listings. These scores are displayed on the “Public Leaderboard” and provide an ongoing

ranking of teams. You can use the scores of your submissions to help you select the best

predictive model.

You will need to manually select one of your Kaggle submissions to be used as final at the

end of the competition. Once the competition is over, Kaggle will rank teams’ final

submissions based on the test cases only, and those will be displayed on the “Private

Leaderboard”. Your goal is to do as well as possible on the Private Leaderboard at the end

of the competition, so please be careful not to overfit the validation cases in an attempt to

improve your public ranking. Please note that the competition ends at 4PM on June 4, which

is exactly 1 hour before the due time for the assignment report.

Page 4 of 5

BUSINESS SCHOOL

Real world relevance:

The ability to perform in a Kaggle competition is highly valued by employers. Some employers

go as far as to set up a Kaggle competition just for recruitment.

Bonus marks:

The five teams with the best performance on the Private Leaderboard will receive bonus

marks for the assignment (with the total Group Project score capped at 100). The best

performing team will receive 10 bonus marks, the second team will get 8 marks, the third will

get 6 marks, the fourth will get 4 marks, and the fifth will get 2 marks (however, the maximum

score will remain at or below 100). Please note that your choice of the final model must be

well justified in the report, and the corresponding Kaggle predictions must come from your

own analysis in Python. An examination of the code will be conducted for verification

purposes. Your code is required to reproduce the winning Kaggle predictions.

联系我们

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-21:00
微信：codinghelp

热点文章

辅导 comm2000 creating socia... 2026-01-08
讲解 isen1000 – introductio... 2026-01-08
讲解 cme213 radix sort讲解 c... 2026-01-08
辅导 csc370 database讲解迭代 2026-01-08
讲解 ca2401 a list of colleg... 2026-01-08
讲解 nfe2140 midi scale play... 2026-01-08
讲解 ca2401 the universal li... 2026-01-08
辅导 engg7302 advanced compu... 2026-01-08
辅导 comp331/557 – class te... 2026-01-08
讲解 soft2412 comp9412 exam辅... 2026-01-08
讲解 scenario # 1 honesty讲解... 2026-01-08
讲解 002499 accounting infor... 2026-01-08
讲解 comp9313 2021t3 project... 2026-01-08
讲解 stat1201 analysis of sc... 2026-01-08
辅导 stat5611: statistical m... 2026-01-08
辅导 mth2010-mth2015 - multi... 2026-01-08
辅导 eeet2387 switched mode ... 2026-01-08
讲解 an online payment servi... 2026-01-08
讲解 textfilter辅导 r语言 2026-01-08
讲解 rutgers ece 434 linux o... 2026-01-08

热点标签

msinm014/msing014/msing014b

联系我们 - QQ: 99515681 微信：codinghelp

程序辅导网！