首页 > > 详细

CVEN9407-Transport Modelling

CVEN9407-Transport Modelling
Project Brief
Introduction 
This document explains the final project of CVEN9407. This project is an individual project and group 
submission is not accepted. The purpose of the project is familiarising students with practical 
econometrics analysis and guiding student on how to draw statistical inference. The project is worth 
50% of the final grade. Students are evaluated based on their submitted progress report and their 
final report. This brief discusses the data, recommended software, the process of data analysis and 
developing models, format of the repots and submission dates are provided in this brief. 
Guidance and assistance 
Students are advised to self-monitor their progress on the project and seek for assistance if needed. 
Students can gauge their performance based on their progress report feedback. 
Students can use their workshop hours to discuss issues with the course demonstrator. If further 
assistance is needed, students can ask for consultation with the course coordinator. 
Software
To accomplish this project, assistance of a statistical software package is required. The statistical 
software package of R is the recommended software package in this study. R is a free statistical 
package which can be download from this website. To facilitate using R, it is recommended to 
download R studio as well. R studio can be downloaded from this website.
Basic introduction to the software will be provided in the lectures, and sample codes for completing 
most of workshop questions will be provided to students. 
Note that, using R is not mandatory, and students can work with other statistical software packages 
if they wish. 
 
Data
The dataset of this study is obtained from the survey of Household Income and Labour Dynamics in 
Australia (HILDA). HILDA is a longitudinal survey which started from 2001 and is planned to continue 
until 2021 (for more information about this survey refer to this website). HILDA contains socio￾demographic information of people. Moreover, it contains respondents’ rates on their satisfaction in 
different domains. The main purpose of this study is investigating the impact of transport related 
variables on life satisfaction. HILDA is a confidential dataset and students must put request to 
DataVerse to obtain it. A separate documentation will be uploaded on Moodle to guide you how to 
get access to HILDA.
*** The very first step of this project is obtaining access to HILDA ***
After obtaining access to HILDA, the dataset of this project will be share with you. Due to 
confidentiality issues, all the personal information is removed from this dataset. 
Every student is supposed to focus on one aspect of life satisfaction in a specific year. To achieve 
your personalised dataset in this project, filter the dataset that is shared with you based on the 
allocated “year” and “variable of interest” given in a separate table. This table is posted on the 
Moodle page. 
You must keep only one of the 9 variables life satisfaction variables in your dataset, which is going 
to be the dependent variable of your study. Note that, life satisfaction variables should not be 
considered as independent variables. 
The variable of interest in this table is your dependent variable in this project, where throughout the 
project, the potential impact of other explanatory variables on this variable will be investigated. 
Variables definition
The definition of most of the variables is provided here. However, some of the fields in the 
processed data set do not exist in the HILDA Data Dictionary. Below you can find the definition of 
these variables. 
The last 40 variable in this list shows the land use variable of individuals’ residences. There are four 
indexes available which describe the socio demographic condition of zones. These indexes are 
generated by Australian Bureau of statistics and are referred to as Socio Economic Indexes for Areas 
(SEIFA). SEIFA variables include:
• The Index of Relative Socio-Economic Disadvantage (IRSD)
• The Index of Relative Socio-Economic Advantage and Disadvantage (IRSAD)
• The Index of Education and Occupation (IEO)
• The Index of Economic Resources (IER).
For more information please visit this webpage. 
Variable Definition
Female Binary variable indicating gender (female =1)
Married Binary variable indicating marital status (married =1)
ESL Binary variable indicating if English is the second language
Le_mar Binary variable indicating if the individual has experienced the life event of marriage last year
Le_sep Binary variable indicating if the individual has experienced the life event of separation last year
Le_job Binary variable indicating if the individual has experienced the life event of job change last year
Variable Definition
Le_bth Binary variable indicating if the individual has experienced the life event of giving birth to a child last year
Le_prg Binary variable indicating if the individual has experienced the life event of becoming pregnant last year
Le_death Binary variable indicating if the individual has experienced the life event of death of spouse/child/close 
friend/relative last year
Le_fni Binary variable indicating if the individual has experienced major improvement in financeS last year
Le_fnw Binary variable indicating if the individual has experienced worsening in finance last year
Le_frd Binary variable indicating if the individual has been fired or redundant last year
Le_prm Binary variable indicating if the individual has been promoted last year
Le_rtr Binary variable indicating if the individual has been retired last year
Le_ins Binary variable indicating if the individual had serious personal enjerys last year
Mltpljob Binary variable indicating if the individual is employed in multiple jobs
Manager Binary variable indicating if the job type is managerial
Professional Binary variable indicating if the job type is professional
Technician Binary variable indicating if the job type is technician
ServiceWorker Binary variable indicating if the job type is service work
Administrative Binary variable indicating if the job type is administrative
SalesWorker Binary variable indicating if the job type is sales worker
MachineryOperator Binary variable indicating if the job type is machinery
Labour Binary variable indicating if the job type is labour
FlxWork Binary variable indicating if the individual has flexible working hours
HmWork Binary variable indicating if the individual can work from home
PrtStudy Binary variable indicating if the individual is doing part time studies
FullStudy Binary variable indicating if the individual is doing full time studies
Postgrad Binary variable indicating education level (postgraduate =1)
Bachelor Binary variable indicating education level (Bachelor=1)
CoupleWo Binary variable indicating if family structure is couple without children
CoupleW Binary variable indicating if family structure is couple with children
LoneW Binary variable indicating if family structure is single parent
Single Binary variable indicating if family structure is single person
Renter Binary variable indicating if the individual is renting his/her living place
hhad10_1 Binary variable indicating if the 'IRSAD' index of the home zone is less than 1
hhad10_2 Binary variable indicating if the 'IRSAD' index of the home zone is less than 2
hhad10_3 Binary variable indicating if the 'IRSAD' index of the home zone is less than 3
hhad10_4 Binary variable indicating if the 'IRSAD' index of the home zone is less than 4
hhad10_5 Binary variable indicating if the 'IRSAD' index of the home zone is less than 5
hhad10_6 Binary variable indicating if the 'IRSAD' index of the home zone is less than 6
hhad10_7 Binary variable indicating if the 'IRSAD' index of the home zone is less than 7
hhad10_8 Binary variable indicating if the 'IRSAD' index of the home zone is less than 8
hhad10_9 Binary variable indicating if the 'IRSAD' index of the home zone is less than 9
hhda10_1 Binary variable indicating if the 'IRSD' index of the home zone is less than 1
hhda10_2 Binary variable indicating if the 'IRSD' index of the home zone is less than 2
hhda10_3 Binary variable indicating if the 'IRSD' index of the home zone is less than 3
hhda10_4 Binary variable indicating if the 'IRSD' index of the home zone is less than 4
hhda10_5 Binary variable indicating if the 'IRSD' index of the home zone is less than 5
hhda10_6 Binary variable indicating if the 'IRSD' index of the home zone is less than 6
hhda10_7 Binary variable indicating if the 'IRSD' index of the home zone is less than 7
hhda10_8 Binary variable indicating if the 'IRSD' index of the home zone is less than 8
hhda10_9 Binary variable indicating if the 'IRSD' index of the home zone is less than 9
hhec10_1 Binary variable indicating if the 'IER' index of the home zone is less than 1
hhec10_2 Binary variable indicating if the 'IER' index of the home zone is less than 2
hhec10_3 Binary variable indicating if the 'IER' index of the home zone is less than 3
hhec10_4 Binary variable indicating if the 'IER' index of the home zone is less than 4
hhec10_5 Binary variable indicating if the 'IER' index of the home zone is less than 5
hhec10_6 Binary variable indicating if the 'IER' index of the home zone is less than 6
hhec10_7 Binary variable indicating if the 'IER' index of the home zone is less than 7
hhec10_8 Binary variable indicating if the 'IER' index of the home zone is less than 8
hhec10_9 Binary variable indicating if the 'IER' index of the home zone is less than 9
hhed10_1 Binary variable indicating if the 'IEO' index of the home zone is less than 1
hhed10_2 Binary variable indicating if the 'IEO' index of the home zone is less than 2
hhed10_3 Binary variable indicating if the 'IEO' index of the home zone is less than 3
hhed10_4 Binary variable indicating if the 'IEO' index of the home zone is less than 4
hhed10_5 Binary variable indicating if the 'IEO' index of the home zone is less than 5
hhed10_6 Binary variable indicating if the 'IEO' index of the home zone is less than 6
hhed10_7 Binary variable indicating if the 'IEO' index of the home zone is less than 7
hhed10_8 Binary variable indicating if the 'IEO' index of the home zone is less than 8
hhed10_9 Binary variable indicating if the 'IEO' index of the home zone is less than 9
Analysis 
1. Data analysis 
1.1. The first step is to familiarise yourself with the data. For that purpose
• Check the definition of variables
• Check for any missing values in the data
• Calculate the mean and the standard deviations of continuous variables
• Calculate the frequencies for discrete variables
• If needed, plot the data to see the variations in variables 
• Check the range of variables and see if it makes sense to you
1.2. The relationship between variables 
• Calculate the correlation matrix for available variables 
• Highlight the strong correlations in the matrix
• Justify your observation. Explain potential reasons behind strong correlations.
• Are there cases which you expect to see strong correlations, but data shows 
otherwise? Discuss these cases.
• Is there any variable that you expect to have a non-linear relationship with the 
dependent variable? If you are not sure, plot the dependent variable against it and see 
if you can detect any pattern.
• For the variables which you are suspect of non-linear relationships, define new 
independent variables with appropriate transformation (logarithmic, exponential, 
second or third power, etc.).
• Include the new independent variables in the correlation matrix and discuss the 
results.
2. Regression analysis
2.1. It is always recommended to divide the dataset into test and train sub-datasets. The train 
dataset, containing 80 percent of records, is used to estimate the parameters of the model 
and the test dataset, containing the remaining 20 percent, is used to validate the model. 
• Use sample() function in R to randomly divide the dataset into test and train datasets. 
Even, if you choose to use other statistical packages for this project, this step should be 
completed using R (This is because the marker will be using R to check your analysis). 
• To avoid making a purely random selection, set the seed number to your student ID.
In this method, although you randomly divide data into test and trains sub-datasets, 
but the process can be repeated. The command in R to fix the seed number is 
set.seed()
2.2. Selecting the set of explanatory variables to be included in the model
• The main purpose of this study is examining the relationship between transport 
related variables and the level of satisfaction. The dependent variable is the level of 
satisfaction and the rest of variables forms the set of independent variables. 
• The available transport related variables in this study are:
o lscom: Travel time to/from paid work per week
o hxymvfi: Household annual expenditure on motor vehicle fuel ($)
o hxymvri: Household annual expenditure on motor vehicle repairs/maintenance ($)
o hxyncri: Household annual expenditure on new motor vehicles, motorbikes or other vehicles
o hxypbti: Household annual expenditure on public transport and taxis
• For each variable run a separate regression model with only one variable and discuss 
the estimated coefficient. 
• For the 31 combinations of transport related variables run a regression model and 
select the best model. The best model has the highest goodness-of-fit, while all the 
included variables are statistically significant. 
• Use the forward stepwise method to add other independent variables to the model 
o Use Bayesian Information Criterion (BIC) index as the improvement criteria in the 
stepwise method. In each step, add one variable to the model. This variable 
should be statistically significant and improve BIC the most.
o Continue the process until either all the variables are exhausted or none of the 
remaining variables can improve BIC any further.
• The model that you have developed so far is achieved from a mechanical process and
theory did not play a role. At this stage you should examine the model to see if fulfils 
existing theories in the field. There are two issues to be taken into consideration. Frist,
exploring the theories on life satisfaction is out of the scope of this subject. So, as a 
simplifying solution, we only rely on our common sense (Note that in real project our 
reference must be accepted theories). Second, from this point, the process becomes 
somehow subjective. In previous steps, BIC and adjusted R square could help you with 
selecting the best model and making modifications on that. However, from this point, 
you need to use your judgment to decide how much of goodness-of-fit can be 
compromised to include or exclude variables based on your expectations (or theories). 
Different modellers have different judgments and different approaches in 
implementing their opinions. So, get ready to grow your own modelling judgment. 
o Justify included variables and the sign of their coefficients. Is there any of the 
variables that you cannot justify, or its sign is counterintuitive? 
o On the other hand, is there any of the remaining variables which you expected to 
be included in your model? 
o Improve your model by putting aside unreasonable variables and including new 
variables from the leftovers that you expected to be included. Most likely, this 
practice deteriorates the model goodness-of-fit. This is where you should decide 
how much you are willing to compromise the goodness-of-fit to improve 
justifiability
o Note that in this study you want to investigate the relationship between 
transport related variables and level of satisfaction. So transport related variables 
should have a higher priority to be included in the model. 
2.3. Testing the assumptions of Classic Linear Regression Model
• List all the assumptions behind CLRM and the statistical test that you prefer to use to 
validate the assumptions. 
• Test your model to see if it satisfies all the assumptions. 
• If your model does not satisfy one, or some of the assumptions, double check the set 
of your independent variables. Sometimes excluding unnecessary variables solves the 
issue. 
• If the problem still exists, use standard methods to rectify the problem. 
2.4. Validation.
• To validate the accuracy of the model, simulate the dependent variable for the test 
dataset and compare the results with the observed values. Discuss the model 
prediction ability. 
2.5. Regarding your report, as you see there is a long process behind developing a regression 
model. However, you do not need to report all the work you have done. Think what would 
be interesting for readers to learn from your endeavour and how to efficiently convey 
highlights of your study. For instance, you can provide a plot on BIC variations in step 2.2. 
which summarises the stepwise process. Your report should include the final model which 
satisfy all the CLRM assumptions and your justification for the coefficients and their signs.
3. Discrete choice analysis 
3.1. Selecting the right model specification
• The first step in developing a discrete choice model is deciding about the model 
specification. The initial decision on model specification is mainly based on the 
dependent variable. Note that this decision might change along the way. 
3.2. Defining choices and setting up the utility functions
• Discrete choice models, as the name implies, are developed to model the outcome of 
selecting one option out of multiple available alternatives. The output of discrete 
choice models is the probability of selecting each of the alternatives. In this study, we
extend the application of discrete choice models to probability of belonging to a 
category, rather than selecting a category. In fact, in our study people do not make a 
decision about their level of satisfaction, but they feel belonging to a certain category. 
Although it does not resemble a choice setup, by modifying our definition of utility 
function we can still use discrete choice models for this context.
• To simplify the model, aggregate the range of your dependent variable into three 
categories of: unsatisfied, moderate, and satisfied. The dependent variable varies from 
0 to 10. Assume values below 5 to indicate dissatisfaction and values above 7 to 
indicate complete satisfaction. Based on this assumption, define a new dependent 
variable which should have three levels. Then calculate the “market share” of each of 
the categories for the test dataset, train dataset and overall. 
• Based on the nature of available independent variables, discuss your alternative 
specific variables and generic variables, then derive a mathematical formulation for the 
utility functions. 
3.3. Estimating the parameters of the model
• According to the selected model specification, and the defined utility function, run a 
discrete choice mode with the same set of independent variables which you concluded 
in your regression model.
• Check model’s goodness-of-fit, statistical significance of the coefficients and the 
interpretation of them. 
• Exclude insignificant variables from the model one by one. Each time that you exclude 
a variable, run the mode again and check the significance of the remaining variables. 
• When you no longer have any insignificant variable, check if you can include any other 
variables that you expected to have an impact on your dependent variable.
• Similar to the regression modelling of this project, this process is also subjective and 
there is no single correct solution. Remember to prioritise transport related variables, 
aim for higher goodness-of-fit, and keep an eye on the significance of coefficients.
3.4. Examining the assumptions behind the selected model specification 
• At this stage you should verify that your model satisfies all the assumptions behind 
your model specification. First, list all the assumptions that need to be tested and 
provide a legitimate statistical test to validate the assumptions. 
• If your model does not satisfy one or a few of the assumptions double check the list of 
independent variables. Sometimes excluding an unimportant variable fixes the issue. 
• If the problem still exists, use standard methods to rectify the problem. 
3.5. Validation.
• To validate the accuracy of the model, simulate the dependent variable for the test 
dataset and compare the results with the observed values. Calculate average share of 
each category from the model and compare it with the observed shares. Discuss the 
model prediction ability. 
 
Deliverables 
This project is an individual project and no group submission is accepted. Students are required to 
submit one progress report and one final report. All the reports should be typed and submitted to 
Moodle as a PDF file. Late submission is accepted but 10% of the mark will be deducted for each day 
of late submission.
 
联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!