GEOM9042 Spatial Information Programing
Semester 1 2018
Assignment 2: Spatial Data Science: data input, manipulation, analysis
and presentation
Due date: Friday 25 May, 1:59pm
Preliminaries
This asesment is a group asesment, where each group can consist of up to two
students. Note that it is okay to hand in this asignment on your own, however you wil
stil be marked in the same way as a group of two students.
This asignment is worth 20% of your final clas mark. The asignment alows for
extensive creativity and self-exploration. Creative solutions and aproaches wil be
bonified with bonus points counting towards the final results (max 3 pts).
No late asignments wil be acepted.
Problem seting:
Imagine you are a spatial data analyst handing trafic acident data colected by the
road authority on a regular basis. You have to:
a) automate the data procesing pipeline to reduce your workload with routine data
procesing;
b) enable regular generation of insights from the data – a report containing tables,
graphs or maps;
c) generate apropriately structured data products as result of your analysis, for
decision makers.
Aim
The aim of this asignment is to increase your exposure and experience with the basic
elements of Python from asignment 1, by incorporating the use of spatial data
input/output, the use of dictionaries and advanced data structures, explore the use of
spatial data manipulation libraries (e.g., geopandas, or Fiona and shapely), and visualize
the results using a data visualization librariy (e.g., matplotlib, plotly, cartopy, altair or
similar). This asignment consists of thre tasks:
Task 1: practice the manipulation of CSV files along with tabular data visualization
and presentation, including producing tables and charts;
Task 2: demonstrate skils of aplying third-party Python libraries to manipulate
spatial data read from shapefiles and/or other spatial data formats;
Task 3: explore and experiment with some (basic to advanced) spatial analysis
concepts and tasks.
Data:
There are two sets of data as the input of this asignment. You can download both of
them as .zip files from LMS.
“ACCIDENT.zip”: This archive contains nine CSV files with road acident reports for
Victoria, provided by VicRoads. The data provided alows users to analyse Victorian
fatal and injury crash data based on time, location, conditions, crash type, road user
type, object hit etc. The file contains the folowing files extracted from the VicRoads
CrashStats database:
Note: you should IGNORE the data entries for 2017 for the folowing tasks, as
those data re incomplete!
“SA2_2016_AUST.zip”: contains the boundaries of the Statistical Areas 2 (as
identified by the Australian Bureau of Statistics).
Task outlines:
You should write thre distinc pieces of analysis (i.e., thre .py files, or a Jupyter
notebok with thre sections, properly documented using markdown) for the folowing
thre tasks respectively. These thre tasks are independent. You don’t ned to submit
input files along with the .py files, and you should asume al the input files are
unziped and in the same folder as the thre .py files. For the output directory, you
should create create a new sub-folder, named “output”.
Task 1. CSV data extraction, presentation, and visualization
Explore al csv files and understand the data. Using wel documented function
definitions meaningfuly organised in modules (imported into your notebok or
script), produce the results as per folowing specifications.
Provide a main function (if handing in separate scripts, executed from the script.
task1_studentno.py)that provides the folowing output (a report) as an html file
(output into task1_studentno.html).
The report should have a meaningful title and subheading including your names,
contact and student number (formated folowing your own preferences):
1. The average number of accidents per year is ____.
2. The second most common type of accident in all the
recorded years is ____, and the percentage of the
accidents that belong to this type is ____%.
You now want to enrich your report by ading graphs and data tables to it. Write
functions to export the folowing two tables as output. The tables should be
inserted into the report as HTML tables:
3. Number of acidents by vehicle type (rows) by year (columns). Organise the
table so that it is sorted by the number of acidents in the first reported year.
An example of HTML table is given below:
4. Compute and sort the top 10 LGAs (Local government areas) that have the
highest number of acidents in 206 in decreasing order. Then, compute the
changes of the numbers of acidents in 2016 compared to the year 206 for
these 10 LGAs. These diferences should be computed both in absolute
numbers and as percentage change. Therefore, your table should have 5
columns (you do not ned to name the columns in exactly the same way):
LGA | No. 2006 | No. 2016 | Difference | Change
You should also create functions to export the content captured by the two tables
into two csv files, “AccidentByYear.csv” and “AccidentByLGA.csv”,
respectively, holding the data from the two tables.
Finaly, write functions that produce the two charts (as png images) described
below to your report. The charts should be exported into a png figure and included
as an HTML image:
5. Using the Matplotlib library, produce a bar chart of accident numbers by days
of the wek, in 206 and 2016. The chart should include 7 * 2 bars. The output
should lok similar to this example:
6. Use the same library to produce a line chart of the yearly change of the total
number of acidents from 206 to 2016, for each severity category. The chart
should include four lines for each severity category respectively. An example of
such a multi-line chart is given below:
Spatial data can be presented nicely and in more advanced ways with the help of third-
party libraries (e.g., matplotlib, plotly, seaborn, and several others), which requires
dep understanding of the data and careful designing of presentation maner. Some
examples are shown below. You are encouraged to explore and play with these
libraries and get bonus marks for an additional visulaization (max 1, leave this until
you at least finish al the required tasks from above and below). If doing so, you are
alowed to create your own task scenarios (but with the same input data).
Task 2. GeoDataFrame, shapefile creation and modification
You now ned to proces the csv data into a spatial dataset (shapefile) that can be used
by other analysts in your company (typicaly using ArcGIS or QGIS). You ned to take
the textual data provided and export them as a spatial dataset using Geopandas and
Shapely. This second task should be executed form. a main function in the script.
task2_studentno.py,or a clearly identified second section of a Jupyter
notebok. To begin, lok at the geopandas documentation first:
http:/geopandas.org/index.html
1. Create a new shapefile caled "AccidentLocations.shp" that contains
point-type data with the folowing atributes: “AccidentNumber”,
“VehicleType”, “DayOfWeek”, “SevereAccident”. You wil ned to create
each feature’s geometry (explore shapely.geometry.Point) in GeoDataFrame.
using geopandas. Beware of the cordinate system asigned to the generated
dataset (use WGS84: 4326).
The first field of the dataset should corespond to the AcidentNumber atribute in
the provided csv files. Examples of the second field is “Car, Car, Motor Cycle” (a
string). The third field coresponds to the day of the wek when the acident
hapened (Monday/Tuesday/…). The last field (SevereAccident) wil be set to
the number 1 (representing “True”) if at least 3 people were involved in the
acident. Otherwise, the field wil be set to 0.
2. Write a function to split the entries in the created shapefile into two shapefiles:
“SevereAccidentsWeekday.shp” and “SevereAccidentsWeekend.shp”.
The first file should include al entries with severe acidents that occured on
wekdays (Mon to Fri). Similarly, the second file should include entries of severe
accidents that hapened on wekends. This step must be done after the shapefile
has ben created, not before; thus, you wil ned to re-read and modify the
GeoDataFrame. from the previously created shapefile, instead of repeating the same
proces from last step.
3. Add a new field “SA2” to "AccidentLocations.shp". The value of the field wil
be set to the Statistical Area 2 name the acident hapened within. The dataset
“SA2_2016_AUST.zip” contains the Statistical Area 2 boundaries and should be used
here. You should use a geometric query (point in polygon, explore shapely function
point.within(polygon)) to identify the SA2 in which the acident hapened. This
function can be made more performant with the use of a spatial index. A bonus point wil
be awarded if you write your code so that it creates and uses a spatial index on top of
your spatial data frames. You can include timing statements using “timeit” to explore the
relative aceleration.
An example of opening the SA2_2016_AUST.shp, AccidentLocations.shp,
and SevereAccidentsWeekday.shp files in QGIS is shown below. If your result
shapefiles lok not similar to the example when opened in QGIS or ArcGIS, you may want
to double check your program.
Task 3. Spatial Data analysis
In this third task, you wil explore the use of spatial analysis libraries (e.g., scipy, sklean,
pysal, etc.) for analysing the spatial paterns of the distribution of acidents based on
your point dataset. Some example analysis tasks are given below. You can either chose
from these tasks or develop your own exploratory ones. Multiple analysis tasks
presented as wel as in-depth discusions and presentations wil receive bonus marks.
Spatial temporal visual analysis (basic)
Create a map showing the number of acident that occured in each statistical area. Are
there any paterns? How about comparing the result for diferent years/wekdays and
wekends/vehicle types, etc.? Note that al analysis incl. presentation (either in the
Jupyter notebok, or exported as a HTML report)must be automated.
Spatial Autocorelation calculation (advanced)
In order to analysis the spatial autocorelation of these acidents (roughly speaking, the
relationship betwen acident frequencies and closenes in space), you can use the
Pysal library, se http:/daribas.org/gds_scipy16/ipynb_md/04_esda.html for an
example.
Clustering analysis (more advanced)
Libraries including Scipy and Sklearn can be used for clustering analysis such as
DBSCAN (parameter-dependent) and Kernel Density Estimation (much les parameter-
dependent) to generate clustered point clouds or probabilistic surfaces. You can
identify spatial concentrations of acidents ( using apropriate choice of parameters for
your functions, or use the power of batch executions to identify parameters that are
meaningful).
Functions for this task should be included in the script. task3_studentno.py or a
clearly identified third part of the notebok. At least one analysis task (either from the
examples or defined by yourself) must be completed. You should visualize the results
using Python visualization libraries and include maps into your HTML report, and
include any observation you can make about any trends in a per visualization. Think of
posible variations of this task – as an analyst, how would you nuance your analysis, or
what improvements can you make?
Include the results of your analysis in a generated HTML report task3_
studentno.html, and discus the results. Reflect on your reasoning. The entire
report should idealy not exced 1.5 page, including the figures.
The goal of the task is to ofer you an oportunity to explore the spatial data analysis
proces, techniques and tols. There is no single corect answer; however, your
selection of problems, configuration of parameters, and discusions (i.e., don’t over
interpret) should stil be reasonable.
Submision
You should submit a zip file, including your fuly comented Python code, named
"_2.zip", as well as all the csv files, html files, figures and shapefiles
produced by your program.
Only one of the two members should submit the actual files via the LMS. The other team
member simply submits a file named _2.py, which contains the names of
the team members writen as a coment. Similarly, the names of both team members
should also be included in the actual Python code as a coment.
Marking scheme
Your assignment wil be marked out of 20 acording to the criteria corectnes,
programing style, output, and coments:
Your Python program is submited corectly and produces wel-formated
output.
Your program performs al required computations and tasks.
Your program is eficient, short, and compact. Code that is modular, for example
could also compute a top twenty list for other indicators, wil score higher marks.
Code that is dificult to understand or uses por programing style. wil lose
marks.
Code that is porly comented (a very rough guideline: a program has a
similar amount of coments than it has code lines) wil also lose marks.
Creative solutions and aproaches beyond the minimal requirements will be
bonified with a maximum of 3 points.
Hints
Make sure you check carefuly that your code works corectly, and produces
sensible answers (you may want to test answers by opening your outputs using a
program that can work with spreadsheets)