FIT5196-S1-2019 assessment

FIT5196-S1-2019 assessment 3 This is an individual assessment and worth 30% of your total mark for FIT5196. Due date: 11:55 PM, Wednesday, June 12, 2019 For this assessment, you are required to write Python (Python 2/3) code to integrate several datasets into one single schema and find and fix possible problems in the data. Input and output of this assessment are shown below: Table 1. The input and output of the task Inputs Output Jupyter notebook vic_suburb_boundary.zip, gtfs.zip Crimebylocation.xlsx .csv council.txt _solution.csv _ass3.ipynb You are given multiple datasets in various formats and the task is about creating housing information in Victoria, Australia. Your assessment is to perform the following tasks. Task 1: Data Integration (60%) In this task, you are required to integrate these datasets into one with the following schema. Table 2. Description of the final schema COLUMN DESCRIPTION ID A unique id for the property Address The property address Suburb (20/100) The property suburb. The suburb must only be calculated using Vic_suburb_boundary.zip. Default value: “not available” Price The property price Type The type of property Date Date of sold Rooms Number of bedrooms Bathroom Number of bathrooms Car The number of parking space of the property LandSize The area of the property Age The age of the property at the time of selling Latitude The Latitude of the property Longitude The Longitude of the property train_station_id (15/100) The closest train station to the property that has a direct trip to the Southern Cross Railway Station. A direct trip is a trip that there are no connections (transfers) in the trip from the origin to the destination. Default value: 0 distance_to_train_stat ion (5/100) The direct distance from the closest train station to the property that has a direct trip to the Southern Cross Railway Station. Default value: 0 travel_min_to_CBD (20/100) The average travel time (minutes) from the closest train station (regional/metropolitan) that has a direct trip to the “Southern Cross Railway Station” on weekdays (i.e. Monday-Friday) departing between 7 to 9:30 am. For example, if there are 3 direct trips departing from the closest train station to the Southern Cross Railway Station on weekdays between 7-9:30 am and each takes 6, 7, and 8 minutes respectively, then the value of this column for the property should be (6+7+8)/3.). Default value: 0 over_priced? (10/100) A boolean feature indicating whether or not the price of the property is higher than the median price of the similar properties (with respect to bedrooms, bathrooms, parking_space, and property_type attributes) in the same suburb on the year of selling. Default value: -1 crime_A_average (7/100) The average of type A crime for three years prior to selling in the local government area of the property as the property. For example, if a property is sold in 2016, then you should calculate the average of the crime type A for 2013, 2014 and 2015. Default value: -1 crime_B_average (7/100) The average of type B crime for three years prior to selling in the local government area as the property. For example, if a property is sold in 2016, then you should calculate the average of the crime type B crime for 2013, 2014 and 2015. Default value: -1 crime_C_average (6/100) The average of type C crime for three years prior to selling in the local government area as the property. For example, if a property is sold in 2016, then you should calculate the average of the crime type C for 2013, 2014 and 2015. Default value: -1 Task 2: data reshaping (15%) In this task, you need to study the effect of different normalization/transformation methods (i.e. standardization, min-max normalization, log, power, and root transformation) on Rooms , crime_C_average, travel_min_to_CBD , and property_age attributes. You need to observe and explain their effect assuming that we want to build a linear model on price using these attributes as the predictors of the linear model and recommend which one(s) do you think would work better on this data. When building the linear model, the same normalization/transformation method can be applied to each of these attributes. Task 3: Documentation and Methodology (25%) The main focus on the documentation would be on the quality of your explanation on finishing these tasks. Your notebook file should be on a decent format with proper sections and subsections. Note 1: the output csv file must have the exact same columns as specified on the schema. If you decide not to calculate any of the required attributes, then you must have a column for that attribute in your final data-frame with the default value as the value of all the rows. Please note that output file which is not in a correct format, as specified in the integrated schema, won’t be marked. Note 2: the radius of the earth is still 6378 km! Note 3: In table 2, numbers in front of some of the rows in the format of (a/b) are the allocated mark associated with that attribute. For example, the “suburb” attribute carries 20% of the total mark of task 1. Please note that 10% of the total marks for task 1 is marked on any other issue that may occur during the data integration process. Note 4: You can only use the vic_suburb_boundary.zip file to extract the suburb name of the property. Using other external datasets or packages (e.g., geopy) to directly get the suburb information will be penalized (this will result in 0 marks for the suburb attribute). Note 4: for more info about GTFS data please visit here , here , and here .