CVEN9407-Transport Modelling

Project Brief

Introduction

This document explains the final project of CVEN9407. This project is an individual project and group

submission is not accepted. The purpose of the project is familiarising students with practical

econometrics analysis and guiding student on how to draw statistical inference. The project is worth

50% of the final grade. Students are evaluated based on their submitted progress report and their

final report. This brief discusses the data, recommended software, the process of data analysis and

developing models, format of the repots and submission dates are provided in this brief.

Guidance and assistance

Students are advised to self-monitor their progress on the project and seek for assistance if needed.

Students can gauge their performance based on their progress report feedback.

Students can use their workshop hours to discuss issues with the course demonstrator. If further

assistance is needed, students can ask for consultation with the course coordinator.

Software

To accomplish this project, assistance of a statistical software package is required. The statistical

software package of R is the recommended software package in this study. R is a free statistical

package which can be download from this website. To facilitate using R, it is recommended to

download R studio as well. R studio can be downloaded from this website.

Basic introduction to the software will be provided in the lectures, and sample codes for completing

most of workshop questions will be provided to students.

Note that, using R is not mandatory, and students can work with other statistical software packages

if they wish.

Data

The dataset of this study is obtained from the survey of Household Income and Labour Dynamics in

Australia (HILDA). HILDA is a longitudinal survey which started from 2001 and is planned to continue

until 2021 (for more information about this survey refer to this website). HILDA contains sociodemographic information of people. Moreover, it contains respondents’ rates on their satisfaction in

different domains. The main purpose of this study is investigating the impact of transport related

variables on life satisfaction. HILDA is a confidential dataset and students must put request to

DataVerse to obtain it. A separate documentation will be uploaded on Moodle to guide you how to

get access to HILDA.

*** The very first step of this project is obtaining access to HILDA ***

After obtaining access to HILDA, the dataset of this project will be share with you. Due to

confidentiality issues, all the personal information is removed from this dataset.

Every student is supposed to focus on one aspect of life satisfaction in a specific year. To achieve

your personalised dataset in this project, filter the dataset that is shared with you based on the

allocated “year” and “variable of interest” given in a separate table. This table is posted on the

Moodle page.

You must keep only one of the 9 variables life satisfaction variables in your dataset, which is going

to be the dependent variable of your study. Note that, life satisfaction variables should not be

considered as independent variables.

The variable of interest in this table is your dependent variable in this project, where throughout the

project, the potential impact of other explanatory variables on this variable will be investigated.

Variables definition

The definition of most of the variables is provided here. However, some of the fields in the

processed data set do not exist in the HILDA Data Dictionary. Below you can find the definition of

these variables.

The last 40 variable in this list shows the land use variable of individuals’ residences. There are four

indexes available which describe the socio demographic condition of zones. These indexes are

generated by Australian Bureau of statistics and are referred to as Socio Economic Indexes for Areas

(SEIFA). SEIFA variables include:

• The Index of Relative Socio-Economic Disadvantage (IRSD)

• The Index of Relative Socio-Economic Advantage and Disadvantage (IRSAD)

• The Index of Education and Occupation (IEO)

• The Index of Economic Resources (IER).

For more information please visit this webpage.

Variable Definition

Female Binary variable indicating gender (female =1)

Married Binary variable indicating marital status (married =1)

ESL Binary variable indicating if English is the second language

Le_mar Binary variable indicating if the individual has experienced the life event of marriage last year

Le_sep Binary variable indicating if the individual has experienced the life event of separation last year

Le_job Binary variable indicating if the individual has experienced the life event of job change last year

Variable Definition

Le_bth Binary variable indicating if the individual has experienced the life event of giving birth to a child last year

Le_prg Binary variable indicating if the individual has experienced the life event of becoming pregnant last year

Le_death Binary variable indicating if the individual has experienced the life event of death of spouse/child/close

friend/relative last year

Le_fni Binary variable indicating if the individual has experienced major improvement in financeS last year

Le_fnw Binary variable indicating if the individual has experienced worsening in finance last year

Le_frd Binary variable indicating if the individual has been fired or redundant last year

Le_prm Binary variable indicating if the individual has been promoted last year

Le_rtr Binary variable indicating if the individual has been retired last year

Le_ins Binary variable indicating if the individual had serious personal enjerys last year

Mltpljob Binary variable indicating if the individual is employed in multiple jobs

Manager Binary variable indicating if the job type is managerial

Professional Binary variable indicating if the job type is professional

Technician Binary variable indicating if the job type is technician

ServiceWorker Binary variable indicating if the job type is service work

Administrative Binary variable indicating if the job type is administrative

SalesWorker Binary variable indicating if the job type is sales worker

MachineryOperator Binary variable indicating if the job type is machinery

Labour Binary variable indicating if the job type is labour

FlxWork Binary variable indicating if the individual has flexible working hours

HmWork Binary variable indicating if the individual can work from home

PrtStudy Binary variable indicating if the individual is doing part time studies

FullStudy Binary variable indicating if the individual is doing full time studies

Postgrad Binary variable indicating education level (postgraduate =1)

Bachelor Binary variable indicating education level (Bachelor=1)

CoupleWo Binary variable indicating if family structure is couple without children

CoupleW Binary variable indicating if family structure is couple with children

LoneW Binary variable indicating if family structure is single parent

Single Binary variable indicating if family structure is single person

Renter Binary variable indicating if the individual is renting his/her living place

hhad10_1 Binary variable indicating if the 'IRSAD' index of the home zone is less than 1

hhad10_2 Binary variable indicating if the 'IRSAD' index of the home zone is less than 2

hhad10_3 Binary variable indicating if the 'IRSAD' index of the home zone is less than 3

hhad10_4 Binary variable indicating if the 'IRSAD' index of the home zone is less than 4

hhad10_5 Binary variable indicating if the 'IRSAD' index of the home zone is less than 5

hhad10_6 Binary variable indicating if the 'IRSAD' index of the home zone is less than 6

hhad10_7 Binary variable indicating if the 'IRSAD' index of the home zone is less than 7

hhad10_8 Binary variable indicating if the 'IRSAD' index of the home zone is less than 8

hhad10_9 Binary variable indicating if the 'IRSAD' index of the home zone is less than 9

hhda10_1 Binary variable indicating if the 'IRSD' index of the home zone is less than 1

hhda10_2 Binary variable indicating if the 'IRSD' index of the home zone is less than 2

hhda10_3 Binary variable indicating if the 'IRSD' index of the home zone is less than 3

hhda10_4 Binary variable indicating if the 'IRSD' index of the home zone is less than 4

hhda10_5 Binary variable indicating if the 'IRSD' index of the home zone is less than 5

hhda10_6 Binary variable indicating if the 'IRSD' index of the home zone is less than 6

hhda10_7 Binary variable indicating if the 'IRSD' index of the home zone is less than 7

hhda10_8 Binary variable indicating if the 'IRSD' index of the home zone is less than 8

hhda10_9 Binary variable indicating if the 'IRSD' index of the home zone is less than 9

hhec10_1 Binary variable indicating if the 'IER' index of the home zone is less than 1

hhec10_2 Binary variable indicating if the 'IER' index of the home zone is less than 2

hhec10_3 Binary variable indicating if the 'IER' index of the home zone is less than 3

hhec10_4 Binary variable indicating if the 'IER' index of the home zone is less than 4

hhec10_5 Binary variable indicating if the 'IER' index of the home zone is less than 5

hhec10_6 Binary variable indicating if the 'IER' index of the home zone is less than 6

hhec10_7 Binary variable indicating if the 'IER' index of the home zone is less than 7

hhec10_8 Binary variable indicating if the 'IER' index of the home zone is less than 8

hhec10_9 Binary variable indicating if the 'IER' index of the home zone is less than 9

hhed10_1 Binary variable indicating if the 'IEO' index of the home zone is less than 1

hhed10_2 Binary variable indicating if the 'IEO' index of the home zone is less than 2

hhed10_3 Binary variable indicating if the 'IEO' index of the home zone is less than 3

hhed10_4 Binary variable indicating if the 'IEO' index of the home zone is less than 4

hhed10_5 Binary variable indicating if the 'IEO' index of the home zone is less than 5

hhed10_6 Binary variable indicating if the 'IEO' index of the home zone is less than 6

hhed10_7 Binary variable indicating if the 'IEO' index of the home zone is less than 7

hhed10_8 Binary variable indicating if the 'IEO' index of the home zone is less than 8

hhed10_9 Binary variable indicating if the 'IEO' index of the home zone is less than 9

Analysis

1. Data analysis

1.1. The first step is to familiarise yourself with the data. For that purpose

• Check the definition of variables

• Check for any missing values in the data

• Calculate the mean and the standard deviations of continuous variables

• Calculate the frequencies for discrete variables

• If needed, plot the data to see the variations in variables

• Check the range of variables and see if it makes sense to you

1.2. The relationship between variables

• Calculate the correlation matrix for available variables

• Highlight the strong correlations in the matrix

• Justify your observation. Explain potential reasons behind strong correlations.

• Are there cases which you expect to see strong correlations, but data shows

otherwise? Discuss these cases.

• Is there any variable that you expect to have a non-linear relationship with the

dependent variable? If you are not sure, plot the dependent variable against it and see

if you can detect any pattern.

• For the variables which you are suspect of non-linear relationships, define new

independent variables with appropriate transformation (logarithmic, exponential,

second or third power, etc.).

• Include the new independent variables in the correlation matrix and discuss the

results.

2. Regression analysis

2.1. It is always recommended to divide the dataset into test and train sub-datasets. The train

dataset, containing 80 percent of records, is used to estimate the parameters of the model

and the test dataset, containing the remaining 20 percent, is used to validate the model.

• Use sample() function in R to randomly divide the dataset into test and train datasets.

Even, if you choose to use other statistical packages for this project, this step should be

completed using R (This is because the marker will be using R to check your analysis).

• To avoid making a purely random selection, set the seed number to your student ID.

In this method, although you randomly divide data into test and trains sub-datasets,

but the process can be repeated. The command in R to fix the seed number is

set.seed()

2.2. Selecting the set of explanatory variables to be included in the model

• The main purpose of this study is examining the relationship between transport

related variables and the level of satisfaction. The dependent variable is the level of

satisfaction and the rest of variables forms the set of independent variables.

• The available transport related variables in this study are:

o lscom: Travel time to/from paid work per week

o hxymvfi: Household annual expenditure on motor vehicle fuel ($)

o hxymvri: Household annual expenditure on motor vehicle repairs/maintenance ($)

o hxyncri: Household annual expenditure on new motor vehicles, motorbikes or other vehicles

o hxypbti: Household annual expenditure on public transport and taxis

• For each variable run a separate regression model with only one variable and discuss

the estimated coefficient.

• For the 31 combinations of transport related variables run a regression model and

select the best model. The best model has the highest goodness-of-fit, while all the

included variables are statistically significant.

• Use the forward stepwise method to add other independent variables to the model

o Use Bayesian Information Criterion (BIC) index as the improvement criteria in the

stepwise method. In each step, add one variable to the model. This variable

should be statistically significant and improve BIC the most.

o Continue the process until either all the variables are exhausted or none of the

remaining variables can improve BIC any further.

• The model that you have developed so far is achieved from a mechanical process and

theory did not play a role. At this stage you should examine the model to see if fulfils

existing theories in the field. There are two issues to be taken into consideration. Frist,

exploring the theories on life satisfaction is out of the scope of this subject. So, as a

simplifying solution, we only rely on our common sense (Note that in real project our

reference must be accepted theories). Second, from this point, the process becomes

somehow subjective. In previous steps, BIC and adjusted R square could help you with

selecting the best model and making modifications on that. However, from this point,

you need to use your judgment to decide how much of goodness-of-fit can be

compromised to include or exclude variables based on your expectations (or theories).

Different modellers have different judgments and different approaches in

implementing their opinions. So, get ready to grow your own modelling judgment.

o Justify included variables and the sign of their coefficients. Is there any of the

variables that you cannot justify, or its sign is counterintuitive?

o On the other hand, is there any of the remaining variables which you expected to

be included in your model?

o Improve your model by putting aside unreasonable variables and including new

variables from the leftovers that you expected to be included. Most likely, this

practice deteriorates the model goodness-of-fit. This is where you should decide

how much you are willing to compromise the goodness-of-fit to improve

justifiability

o Note that in this study you want to investigate the relationship between

transport related variables and level of satisfaction. So transport related variables

should have a higher priority to be included in the model.

2.3. Testing the assumptions of Classic Linear Regression Model

• List all the assumptions behind CLRM and the statistical test that you prefer to use to

validate the assumptions.

• Test your model to see if it satisfies all the assumptions.

• If your model does not satisfy one, or some of the assumptions, double check the set

of your independent variables. Sometimes excluding unnecessary variables solves the

issue.

• If the problem still exists, use standard methods to rectify the problem.

2.4. Validation.

• To validate the accuracy of the model, simulate the dependent variable for the test

dataset and compare the results with the observed values. Discuss the model

prediction ability.

2.5. Regarding your report, as you see there is a long process behind developing a regression

model. However, you do not need to report all the work you have done. Think what would

be interesting for readers to learn from your endeavour and how to efficiently convey

highlights of your study. For instance, you can provide a plot on BIC variations in step 2.2.

which summarises the stepwise process. Your report should include the final model which

satisfy all the CLRM assumptions and your justification for the coefficients and their signs.

3. Discrete choice analysis

3.1. Selecting the right model specification

• The first step in developing a discrete choice model is deciding about the model

specification. The initial decision on model specification is mainly based on the

dependent variable. Note that this decision might change along the way.

3.2. Defining choices and setting up the utility functions

• Discrete choice models, as the name implies, are developed to model the outcome of

selecting one option out of multiple available alternatives. The output of discrete

choice models is the probability of selecting each of the alternatives. In this study, we

extend the application of discrete choice models to probability of belonging to a

category, rather than selecting a category. In fact, in our study people do not make a

decision about their level of satisfaction, but they feel belonging to a certain category.

Although it does not resemble a choice setup, by modifying our definition of utility

function we can still use discrete choice models for this context.

• To simplify the model, aggregate the range of your dependent variable into three

categories of: unsatisfied, moderate, and satisfied. The dependent variable varies from

0 to 10. Assume values below 5 to indicate dissatisfaction and values above 7 to

indicate complete satisfaction. Based on this assumption, define a new dependent

variable which should have three levels. Then calculate the “market share” of each of

the categories for the test dataset, train dataset and overall.

• Based on the nature of available independent variables, discuss your alternative

specific variables and generic variables, then derive a mathematical formulation for the

utility functions.

3.3. Estimating the parameters of the model

• According to the selected model specification, and the defined utility function, run a

discrete choice mode with the same set of independent variables which you concluded

in your regression model.

• Check model’s goodness-of-fit, statistical significance of the coefficients and the

interpretation of them.

• Exclude insignificant variables from the model one by one. Each time that you exclude

a variable, run the mode again and check the significance of the remaining variables.

• When you no longer have any insignificant variable, check if you can include any other

variables that you expected to have an impact on your dependent variable.

• Similar to the regression modelling of this project, this process is also subjective and

there is no single correct solution. Remember to prioritise transport related variables,

aim for higher goodness-of-fit, and keep an eye on the significance of coefficients.

3.4. Examining the assumptions behind the selected model specification

• At this stage you should verify that your model satisfies all the assumptions behind

your model specification. First, list all the assumptions that need to be tested and

provide a legitimate statistical test to validate the assumptions.

• If your model does not satisfy one or a few of the assumptions double check the list of

independent variables. Sometimes excluding an unimportant variable fixes the issue.

• If the problem still exists, use standard methods to rectify the problem.

3.5. Validation.

• To validate the accuracy of the model, simulate the dependent variable for the test

dataset and compare the results with the observed values. Calculate average share of

each category from the model and compare it with the observed shares. Discuss the

model prediction ability.

Deliverables

This project is an individual project and no group submission is accepted. Students are required to

submit one progress report and one final report. All the reports should be typed and submitted to

Moodle as a PDF file. Late submission is accepted but 10% of the mark will be deducted for each day

of late submission.

联系我们

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-21:00
微信：codinghelp

热点文章

辅导 comm2000 creating socia... 2026-01-08
讲解 isen1000 – introductio... 2026-01-08
讲解 cme213 radix sort讲解 c... 2026-01-08
辅导 csc370 database讲解迭代 2026-01-08
讲解 ca2401 a list of colleg... 2026-01-08
讲解 nfe2140 midi scale play... 2026-01-08
讲解 ca2401 the universal li... 2026-01-08
辅导 engg7302 advanced compu... 2026-01-08
辅导 comp331/557 – class te... 2026-01-08
讲解 soft2412 comp9412 exam辅... 2026-01-08
讲解 scenario # 1 honesty讲解... 2026-01-08
讲解 002499 accounting infor... 2026-01-08
讲解 comp9313 2021t3 project... 2026-01-08
讲解 stat1201 analysis of sc... 2026-01-08
辅导 stat5611: statistical m... 2026-01-08
辅导 mth2010-mth2015 - multi... 2026-01-08
辅导 eeet2387 switched mode ... 2026-01-08
讲解 an online payment servi... 2026-01-08
讲解 textfilter辅导 r语言 2026-01-08
讲解 rutgers ece 434 linux o... 2026-01-08

热点标签

msinm014/msing014/msing014b

联系我们 - QQ: 99515681 微信：codinghelp

程序辅导网！