groupname_assignment_b
Note that this is a group assignment - you are required to work in a group of 2-4 students for the
whole assignment. One set of peer evaluation forms (submitted via blackboard) is required for
Assignment B.
1.1 Background
You have been employed by a company that sells apps and devices to help drivers reduce their
risk of infringing on road rules (and getting caught!). The marketing department has come up
with a two-pronged campaign that it wants to target to different demographics.
Theyplantomarketthefinancialimplicationsofinfringementstouniversitystudents,through
an education campaign about the type and cost of infringements that are most likely to occur.
They plan to market the safety aspects of their products to young families, focussing on situa-
tions where child safety is at risk.
Your job is to help support these campaigns: First, to establish the market for them, and sec-
ondly to provide information that will be used in the education pieces of the campaigns.
NB: The data set used in this assignment is both real and very recent (from the NSW
Office of State Revenue, see [http://data.gov.au/] for all open government data sets, or
[http://www.revenue.nsw.gov.au/info/statistics] for this particular one - the "Penalty Notice
Data Set"). That means you may be the first person in the world to uncover an error, quirky
fact, or meaningful result. Good luck!
1.2 Submission Instructions
1. EachgroupneedstosubmitasingleJupyternotebook(.ipynbfile)whichcontainsalloftheir
code and analysis, via the link on Blackboard (Assessment, Assignment B Submission).
2. The provided material is a zip file containing a template notebook (this document:
x67x72x6Fx75x70x6Ex61x6Dx65x5Fx61x73x73x69x67x6Ex6Dx65x6Ex74x5Fx62x2Ex69x70x79x6Ex62, also as a pdf), two data files (x70x65x6Ex61x6Cx74x79x5Fx64x61x74x61x5Fx74x72x61x69x6Ex2Ex63x73x76
and x70x65x6Ex61x6Cx74x79x5Fx64x61x74x61x5Fx74x65x73x74x2Ex63x73x76), and an excel spreadsheet describing the data set
(x70x65x6Ex61x6Cx74x79x5Fx64x61x74x61x5Fx64x65x73x63x72x69x70x74x69x6Fx6Ex2Ex78x6Cx73x78).
3. Complete the template notebook with your code. You may make extra cells as you prefer,
but please leave the question cells there for ease of reading.
4. The notebook will be run using the menu "Cell->Run All" (using the latest Python 3 based
Anaconda Python installation available on the date the assignment is posted), with the
x70x65x6Ex61x6Cx74x79x5Fx64x61x74x61x5Fx73x65x74x2Ex63x73x76 file in the same folder as the notebook.
1
5. All of your outputs (x2Ex63x73x76 files) need to be written to that same directory, with the filename
and format as requested by the question.
6. The correctness of produced x2Ex63x73x76 files will be assessed automatically (by a python script),
so specifications must be followed precisely. The most important thing is to have the (exact)
correct column names and row ordering. The bold numbers (index) of the data frame. will
be ignored, so don’t worry about them.
7. Use Markdown Cells for longer explanation of your work and analysis, as required by some
of the questions.
8. A short assessment of the content of the notebook will be made (for code style, clarity of
explanation, and validity of your approach).
1.3 Marking Criteria
1. Correctness of results as per the given training / validation split.
2. Correctness of results on a different random training / validation split (to be determined
by the marker after the assignment is handed in). This means that excessively tuning your
results for the exact training/test data is not a good idea.
3. Clear, well commented code (using the "#" symbol to add comments to explain your think-
ing). This is particularly important when a result is incorrect, as you may still be able to get
partial marks for your answer.
4. Specific marking criteria as described in the questions below.
1.4 Suggested Resources
While posting the questions online is strictly forbidden by the University’s academic honesty pol-
icy, you may find help in a variety of ways:
• You should be able to do the whole assignment with the following packages, which have
very helpful documentation on their websites:
• pandas: http://pandas.pydata.org/ (e.g. x69x6Dx70x6Fx72x74 x70x61x6Ex64x61x73 x61x73 x70x64)
• scikit-learn: http://scikit-learn.org/stable/index.html (e.g. x66x72x6Fx6D x73x6Bx6Cx65x61x72x6Ex2Ex74x72x65x65 x69x6Dx70x6Fx72x74
x44x65x63x69x73x69x6Fx6Ex54x72x65x65x43x6Cx61x73x73x69x66x69x65x72)
• There are many helpful online forums where python developers and data scientists discuss
the best ways of solving particular problems. http://stackoverflow.com is the biggest, and
will likely appear in any googling you do.
• If you still feel stuck with the basics, there are many free online resources to help you get up
and running with the basics, e.g. http://datacamp.com, and inexpensive e-books such as
those on O’Reilly.
1.5 Errors
If you believe there are any errors with the assignment please email the lecturer immediately at
x6Dx69x63x68x61x65x6Cx2Ex62x65x77x6Cx65x79x40x73x79x64x6Ex65x79x2Ex65x64x75x2Ex61x75.
1.6 Setup
The code below reads the data file, creates a training and test data set, and displays the first five
rows for you. Add code in the cells below (make more cells if you like) to answer the questions.
DO NOT EDIT (except for adding in your group name)
2
x49x6E x5B x5Dx3A x25x70x79x6Cx61x62 x6Ex6Fx74x65x62x6Fx6Fx6B
x69x6Dx70x6Fx72x74 x70x61x6Ex64x61x73 x61x73 x70x64
x23 x45x44x49x54 x48x45x52x45x3A x52x65x70x6Cx61x63x65 x74x68x69x73 x62x69x74 x77x69x74x68 x61 x75x6Ex69x71x75x65 x6Ex61x6Dx65 x66x6Fx72 x79x6Fx75x72 x67x72x6Fx75x70
x47x52x4Fx55x50x5Fx4Ex41x4Dx45 x3D x22x6Dx79x5Fx67x72x6Fx75x70x5Fx6Ex61x6Dx65x22
x64x66x5Fx74x72x61x69x6E x3D x70x64x2Ex72x65x61x64x5Fx63x73x76x28x27x70x65x6Ex61x6Cx74x79x5Fx64x61x74x61x5Fx74x72x61x69x6Ex2Ex63x73x76x27x2C x70x61x72x73x65x5Fx64x61x74x65x73x3Dx5Bx27x4Fx46x46x45x4Ex43x45x5Fx4Dx4Fx4Ex54x48x27x5Dx29
x64x66x5Fx74x65x73x74 x3D x70x64x2Ex72x65x61x64x5Fx63x73x76x28x27x70x65x6Ex61x6Cx74x79x5Fx64x61x74x61x5Fx74x65x73x74x2Ex63x73x76x27x2C x70x61x72x73x65x5Fx64x61x74x65x73x3Dx5Bx27x4Fx46x46x45x4Ex43x45x5Fx4Dx4Fx4Ex54x48x27x5Dx29
1.7 Question 1 (2 marks)
NB: Start with x64x66x5Fx74x72x61x69x6E
Initial exploratory analysis: Let’s find out the infringements that bring in the most revenue.
List all of the offence codes that brought in at least $1 million (aggregated throughout the entire
duration of the data set for each offence code). List them in a dataframe. with the description, the
number of occurrences, and the total revenue brought in by that offence. Order from highest to
lowest total revenue.
The format for the data frame. "df_top_offences" before saving to csv should be:
OFFENCE_CODE OFFENCE_DESC TOTAL_NUMBER TOTAL_VALUE
79053 Use unregistered registrable Class A motor veh... 185602 116218644
6963 Disobey no stopping sign 456009 103148508
1.7.1 Marking Guide
• 1 mark - Partially correct solution (fails automatic verification, but passes some manual in-
spection of code and results).
• 2 marks - Passes automatic verification for correct results
x49x6E x5B x5Dx3A x23 x51x31 x48x45x52x45
x64x66x5Fx74x6Fx70x5Fx6Fx66x66x65x6Ex63x65x73 x3D
x23x2Ex2Ex2E
x64x66x5Fx74x6Fx70x5Fx6Fx66x66x65x6Ex63x65x73x2Ex74x6Fx5Fx63x73x76x28x27x71x31x5Fx7Bx7Dx2Ex63x73x76x27x2Ex66x6Fx72x6Dx61x74x28x47x52x4Fx55x50x5Fx4Ex41x4Dx45x29x29
1.8 Question 2 (3 marks)
NB: Start with x64x66x5Fx74x6Fx70x5Fx6Fx66x66x65x6Ex63x65x73
The marketing team wants to do a campaign about infringements relating to red lights. Take
the data frame. "df_top_offences" from Question 1, and restrict it to only those entries that mention
the colour "red" (careful!). Save it as a csv in the same format as per Question 1.
1.8.1 Marking Guide
• 1 mark - Fails automatic verification, but solution has some correct aspects.
• 2 marks - Fails automatic verification with minor errors, e.g. text search not quite accurate,
or wrong order.
• 3 marks - Passes automatic verification for correct results
3
x49x6E x5B x5Dx3A x23 x51x32 x48x45x52x45
x64x66x5Fx74x6Fx70x5Fx6Fx66x66x65x6Ex63x65x73x5Fx72x65x64 x3D
x23x2Ex2Ex2E
x64x66x5Fx74x6Fx70x5Fx6Fx66x66x65x6Ex63x65x73x5Fx72x65x64x2Ex74x6Fx5Fx63x73x76x28x27x71x32x5Fx7Bx7Dx2Ex63x73x76x27x2Ex66x6Fx72x6Dx61x74x28x47x52x4Fx55x50x5Fx4Ex41x4Dx45x29x29
1.9 Question 3 (5 marks)
NB: Start with x64x66x5Fx74x72x61x69x6E
The marketing team now wants to understand the magnitude of infringements that relate to
child safety, for use with their "young families" customer segment.
Take the original data frame. ("df") and find any offence (regardless of number of occurrences)
that relates to children (or school zones), based on the text. You’ll have to come up with your own
definition for what this means - please explain it in a comment. Add a new boolean column called
x43x48x49x4Cx44x5Fx52x45x4Cx41x54x45x44 that is x54x72x75x65 when the x4Fx46x46x45x4Ex43x45x5Fx44x45x53x43 matches your search, and x46x61x6Cx73x65 when it does
not. Leave rows in the same order as the x64x66x5Fx74x72x61x69x6E that you read in at the start.
Save data in csv of the following format:
CHILD_RELATED OFFENCE_DESC
False Proceed through red traffic light - Camera Det...
False Stop on/near marked foot crossing
False Enter restricted area without offering ticket ...
1.9.1 Marking Guide
• 1-2 marks - Solution incorrect, but some correct aspects.
• 3-4 marks - Some minor errors with the solution.
• 5 marks - Passes automatic verification for correct results (some leniency given to differing
interpretations of "child related").
x49x6E x5B x5Dx3A x23 x51x33 x48x45x52x45
x64x66x5Fx63x68x69x6Cx64x5Fx72x65x6Cx61x74x65x64 x3D
x23x2Ex2Ex2E
x64x66x5Fx63x68x69x6Cx64x5Fx72x65x6Cx61x74x65x64x2Ex74x6Fx5Fx63x73x76x28x27x71x33x5Fx7Bx7Dx2Ex63x73x76x27x2Ex66x6Fx72x6Dx61x74x28x47x52x4Fx55x50x5Fx4Ex41x4Dx45x29x29
1.10 Question 4 (10 marks)
Imagine the office of state revenue has just announced some changes that will be made to the data
set in future (hey, you’re lucky they bothered to announce it!).
1. They want to "simplify" the data by removing precise details of the infringements:
• The x4Fx46x46x45x4Ex43x45x5Fx43x4Fx44x45 and x4Fx46x46x45x4Ex43x45x5Fx44x45x53x43 columns will no longer be given in future.
• The x46x41x43x45x5Fx56x41x4Cx55x45 and x54x4Fx54x41x4Cx5Fx4Ex55x4Dx42x45x52 of infringement columns will be removed (but the
x54x4Fx54x41x4Cx5Fx56x41x4Cx55x45 column will stay).
2. The x53x43x48x4Fx4Fx4Cx5Fx5Ax4Fx4Ex45x5Fx49x4Ex44 column will no longer be available in future.
Your marketing team panics that this data set, which is core to their "child related" strategy, is
about to become useless for ongoing campaigns. You assure them that you can build a predictive
4
model which can make a reasonable guess whether a line entry in the new data set is about a child
related offence, based on the remaining columns that will be left in the data.
Build a model that predicts whether a line represents a x43x48x49x4Cx44x5Fx52x45x4Cx41x54x45x44 infringement, as de-
fined previously, using the remaining variables in x64x66x5Fx74x72x61x69x6E. Hint: Using dates in prediction is
probably unwise.
Write the predictions for the test data set to a csv file in the following format, preserving the
same row order as x64x66x5Fx74x65x73x74, where x43x48x49x4Cx44x5Fx52x45x4Cx41x54x45x44 is the same as in your answer to Question 3,
and x43x48x49x4Cx44x5Fx52x45x4Cx41x54x45x44x5Fx50x52x45x44x49x43x54x49x4Fx4E is the binary (True/False) output of your predictive model for
each row:
CHILD_RELATED CHILD_RELATED_PREDICTION
False False
False False
False ...
1.10.1 Marking Guide
• 1-4 marks - Code exhibits some aspects of a correct model build, but either no scores are
produced, or the model is no better than random guessing.
• 5 marks - Model achieves fair (better than random) performance on the provided test set
• 6-8 marks - Model achieves fair to good performance on a different random split of the
training/test data.
• 9-10 marks - Model achieves good to outstanding performance on an undisclosed test
method.
NB: Questions about what a "good" model performance is, will not be answered, other than
the generic "<50% means you’ve done something wrong, 50% is the same as a random guess,
and 100% is a perfect model". We are simulating a "real world" model build, where you are not
provided with a definition of "good enough" prior to building it! A range of binary performance
metrics will be used in the assessment.
x49x6E x5B x5Dx3A x79x5Fx74x72x61x69x6E x3D x64x66x5Fx63x68x69x6Cx64x5Fx72x65x6Cx61x74x65x64x2Ex43x48x49x4Cx44x5Fx52x45x4Cx41x54x45x44
x58x5Fx74x72x61x69x6E x3D x64x66x5Fx74x72x61x69x6Ex2Ex64x72x6Fx70x28x5B
x27x53x43x48x4Fx4Fx4Cx5Fx5Ax4Fx4Ex45x5Fx49x4Ex44x27x2C x27x4Fx46x46x45x4Ex43x45x5Fx43x4Fx44x45x27x2C x27x4Fx46x46x45x4Ex43x45x5Fx44x45x53x43x27x2C x27x46x41x43x45x5Fx56x41x4Cx55x45x27x2C x27x54x4Fx54x41x4Cx5Fx4Ex55x4Dx42x45x52x27x5Dx2C
x61x78x69x73x3Dx31x29
x23 x44x6F x77x68x61x74x65x76x65x72 x79x6Fx75 x6Cx69x6Bx65 x77x69x74x68 x74x68x65 x72x65x73x74 x6Fx66 x74x68x65 x58 x76x61x72x69x61x62x6Cx65x73x2E
x23 x42x75x69x6Cx64 x61 x63x6Cx61x73x73x69x66x69x65x72
x23 x48x69x6Ex74x3A x55x73x65 x63x72x6Fx73x73 x76x61x6Cx69x64x61x74x69x6Fx6E x6Fx6E x64x66x5Fx74x72x61x69x6E x74x6F x74x75x6Ex65 x61x6Ex79 x68x79x70x65x72x70x61x72x61x6Dx65x74x65x72x73x2E
x23 x44x6F x74x68x65 x73x61x6Dx65 x6Dx61x6Ex69x70x75x6Cx61x74x69x6Fx6Ex73 x74x6F x64x66x5Fx74x65x73x74 x61x73 x79x6Fx75 x64x69x64 x74x6F x64x66x5Fx74x72x61x69x6E x28x73x65x6Cx65x63x74x69x6Ex67 x69x6Ex70x75x74 x76x61x72x69x61x62x6Cx65x73 x65x74x63x29
x23 x61x6Ex64 x70x72x6Fx64x75x63x65 x70x72x65x64x69x63x74x69x6Fx6Ex73 x6Fx6E x74x68x61x74 x64x61x74x61 x73x65x74x2E
x79x5Fx70x72x65x64x2Ex74x6Fx5Fx63x73x76x28x27x71x34x5Fx7Bx7Dx2Ex63x73x76x27x2Ex66x6Fx72x6Dx61x74x28x47x52x4Fx55x50x5Fx4Ex41x4Dx45x29x29
1.11 Question 5 (10 marks)
One of the most important aspects of data science is serendipitous discovery. If you’re lucky, you
may have been asked by a business to do one fairly straightforward analysis, but you discover
5
something else important along the way. More commonly, you will be provided with some data
and a vague business goal, and expected to come up with something insightful that impacts the
business. This is your job for Question 5 - still pretending you’re working for the same company
described above, perform. an unsupervised learning analysis, and write a report (in markdown
cells below) documenting what you find. You must use either PCA or k-means, do some visuali-
sation of results, and explain what you see.
Pretend companies aside, this is a real, up-to-date open government data set. If you find out
something important that relates to the real world, you’re playing for more than just uni marks.
Maybe you’ll alert the government to policy error or fraud. You could discover something juicy
that’s of interest to the Australian media. Perhaps you’ll even find a business opportunity and
make some money! If you find out something important that isn’t part of the "core business" of
the pretend company - don’t worry. Great analysis and insight are the best way to get marks.
• 1-4 marks - Some attempt at analysis is made (a few graphs, and a bit of explanation), but
there is neither correct use of PCA nor k-means, and nothing particularly insightful.
• 5 marks - Correctly applies PCA or k-means, and comes up with an interesting insight.
• 6-8 marks - Clear explanations, good visualisations, correct use of PCA or k-means, and a
meaningful insight.
• 9-10 marks - Gets all the basics above, but comes up with a genuinely compelling insight.
2 5. Unsupervised Learning Report
Put your answers here! Markdown cells support all sorts of formatting (not quite as flexible as Word, but
enough to write a good report).