(5 Marks)
You have been provided with CSV files containing data about real estate sales in Melbourne. Unfortunately, the data is 'noisy': some people have made data entry mistakes, or intentionally entered incorrect data. Your first task is to clean up the noisy data for later analysis.
There are a few particular errors in this data:
Typos have occurred in the Price column, resulting in some non-numeric values.
People have entered Regionname values that are no longer current, valid region names. The valid regions are listed in a list called VALID_REGION_NAMES, which is given to you.
Some people have formatted the sale date incorrectly, such that it is either not of the form dd/mm/yyyy or contains invalid dates, such as 31/11/2016. If a date does not fall under the year 2016 or the year 2017, it is also considered invalid.
Some Postcode values are not valid Melbourne postcode values: they need to be 4-digit integers that start with 3.
Write a function clean_sales_data(data) which takes one argument, a dictionary of data in the format returned by read_data. This data has been read directly from a CSV file, and is noisy. Your function should construct and return a new data dictionary which is identical to the input dictionary, except that invalid data, as described by the rules above, have been replaced with None. In doing so, your program should aim to not modify the input argument dictionary, data.
There is also one further type of potential error in the noisy data that needs to be fixed. Valid values in the CouncilArea column must be in the set of Melbourne council areas: Banyule, Brimbank, Darebin, Hume, Knox, Maribyrnong, Melbourne, Moonee Valley, Moreland, Whittlesea and Yarra. This list of valid councils can be found in the list VALID_COUNCIL_AREAS, which is given to you.
However, values in the CouncilArea column could be incorrectly spelt or contain extra characters. Examples of such incorrect values are: Morelnd; Yara; nox; aDarebin; Melbun; Honey Valley.
Your program should attempt to replace incorrect values with their correct council area value by using the following set similarity measure, which we use to provide a number representing the similarity between two strings. Suppose we have the following two example strings:
string1 = "Aaa bBb ccC" # The set representation of string1 is {'a', 'b', 'c', ' '}
string2 = "bbb ccc ddd" # The set representation of string2 is {'b', 'c', 'd', ' '}
Notice that for our purposes/definition case does not matter (e.g. 'A' is the same as 'a') and that space is also a character.
The set similarity (Sim) measure for string1 and string2 is given by the formula:
Sim(string1,string2)=|S1∩S2||S1∪S2|Where ∩ is set intersection, ∪ is set union, and |X| is the length of set X
So, for example:
Sim(string1,string2)=|{a,b,c,(space)}∩{b,c,d,(space)}||{a,b,c,(space)}∪{b,c,d,(space)}|
=|{b,c,(space)}||{a,b,c,d,(space)}|=35=0.6
Now, when your program comes across an incorrect council area value it should compare that incorrect value with all of the correct council area strings using the Sim function, and then replace the incorrect string with the valid council area value that it has the highest Sim measure with.
For example, when the incorrect value is "Melbun", the measure comparisons are:
Sim("Banyule", "Melbun") = 0.625
Sim("Brimbank", "Melbun") = 0.3
Sim("Darebin", "Melbun") = 0.3
Sim("Hume", "Melbun") = 0.42857142857142855
Sim("Knox", "Melbun") = 0.1111111111111111
Sim("Maribyrnong", "Melbun") = 0.25
Sim("Melbourne", "Melbun") = 0.75
Sim("Moonee Valley", "Melbun") = 0.36363636363636365
Sim("Moreland", "Melbun") = 0.4
Sim("Whittlesea", "Melbun") = 0.16666666666666666
Sim("Yarra", "Melbun") = 0.0
Since Sim("Melbourne", "Melbun") is the highest of all these values, the incorrect "Melbun" would be replaced by "Melbourne".
In cases where an incorrect string has the same similarity score with two or more correct council area values, since the program can't decide on which one might be correct, it just replaces the incorrect value with None. For example, when the incorrect council area string value is "Meorln", the two highest similarities with correct council area values are:
Sim("Moreland", "Meorln") = Sim("Melbourne", "Meorln") = 0.75
Since the program has no further way of deciding between the two, it just replaces "Meorln" with None.
Example 1
The first example applies the clean_sales_data function to the input data contained in sales_data_noisy_sample.csv. Notice the use of Python's pprint function in this example, which is like print but 'beautifies' the output, so it is easier to read. This is particularly useful to visualise dictionaries.
'CouncilArea': 'Whittlesea',
'Landsize': '257',
'Latitude': '-37.65636',
'Longitude': '145.03997',
'Postcode': '3082',
'Price': '421000',
'Regionname': 'Northern Metropolitan',
'SaleDate': None,
'Seller': 'Ray',
'Suburb': 'Mill Park'},
'7': {'Address': '2/4 Currajong St',
'Bathrooms': '1',
'Bedrooms': '3',
'CarSpaces': '1',
'CouncilArea': 'Whittlesea',
'Landsize': '340',
'Latitude': '-37.67803',
'Longitude': '145.01769',
'Postcode': '3074',
'Price': None,
'Regionname': None,
'SaleDate': '24/06/2017',
'Seller': 'Love',
'Suburb': 'Thomastown'},
'8': {'Address': '10/30 Pickett St',
'Bathrooms': '1',
'Bedrooms': '1',
'CarSpaces': '0',
'CouncilArea': 'Maribyrnong',
'Landsize': '30',
'Latitude': '-37.80141',
'Longitude': '144.89587',
'Postcode': None,
'Price': '170000',
'Regionname': 'Western Metropolitan',
'SaleDate': '1/07/2017',
'Seller': 'Burnham',
'Suburb': 'Footscray'}}
Notice the None values in the nested dictionaries of the cleaned data.
Example 2
>>> data_noisy = read_data('sales_data_noisy_sample2.csv')
>>> pprint(clean_sales_data(data_noisy))
{'1': {'Address': '7/459 Waterdale Rd',
'Bathrooms': '1',
'Bedrooms': '2',
'CarSpaces': '1',
'CouncilArea': 'Banyule',
'Landsize': '90',
'Latitude': '-37.74391',
'Longitude': '145.04669',
'Postcode': '3081',
'Price': None,
'Regionname': 'Eastern Metropolitan',
'SaleDate': '12/08/2017',
'Seller': 'hockingstuart',
'Suburb': 'Heidelberg West'},
'10': {'Address': '147 George St',
'Bathrooms': '1',
'Bedrooms': '3',
'CarSpaces': '0',
'CouncilArea': 'Knox',
'Landsize': '730',
'Latitude': '-37.88919',
You can assume the following:
the input data dictionary does not contain None values.
Be careful!
Use copy.deepcopy() will lose 50% (1 mark) for approach.
list.copy() and dict.copy() are okay to use.
(5 Marks)
Write a function called sale_stats_by_council(data, start_date, end_date) which takes three arguments: a dictionary of clean data in the format returned by read_data, a start date of the form dd/mm/yyyy, and an end date of the form dd/mm/yyyy (both inclusive). You can assume date inputs are valid.
For each council area, the function calculates the following four statistics within the start_date to end_date range:
average sale Price, rounded to the closest whole number.
sum total Price
minimum instance of Price
maximum instance of Price
If the start date is greater than the end date the function should return -1. You can use Python date functionality to compare two dates. Googling will help you to find ways that this can be done.
The result is a dictionary with this information, as shown in the examples below. If a council does not have this information present in the data, then the values for that council are set to None.
Example 1
>>> data_cleaned = read_data("sales_data_clean_sample.csv")
>>> pprint(sale_stats_by_council(data_cleaned, '01/01/2016', '31/12/2017'))
{'Banyule': {'avg': None, 'max': None, 'min': None, 'sum': None},
'Brimbank': {'avg': None, 'max': None, 'min': None, 'sum': None},
'Darebin': {'avg': None, 'max': None, 'min': None, 'sum': None},
'Hume': {'avg': 280000, 'max': 280000, 'min': 280000, 'sum': 280000},
'Knox': {'avg': None, 'max': None, 'min': None, 'sum': None},
'Maribyrnong': {'avg': 170000, 'max': 170000, 'min': 170000, 'sum': 170000},
'Melbourne': {'avg': None, 'max': None, 'min': None, 'sum': None},
'Moonee Valley': {'avg': 905000, 'max': 905000, 'min': 905000, 'sum': 905000},
'Moreland': {'avg': 675000, 'max': 675000, 'min': 675000, 'sum': 675000},
'Whittlesea': {'avg': 440500, 'max': 460000, 'min': 421000, 'sum': 881000},
'Yarra': {'avg': 1068500, 'max': 1102000, 'min': 1035000, 'sum': 2137000}}
Example 2
>>> data_cleaned = read_data("sales_data_clean.csv")
>>> pprint(sale_stats_by_council(data_cleaned, '01/01/2016', '31/12/2017'))
{'Banyule': {'avg': 957879, 'max': 3300000, 'min': 420000, 'sum': 139850400},
'Brimbank': {'avg': 585941, 'max': 840000, 'min': 145000, 'sum': 9961000},
'Darebin': {'avg': 871219, 'max': 2300000, 'min': 385000, 'sum': 27879000},
'Hume': {'avg': 563995, 'max': 1170000, 'min': 280000, 'sum': 62603500},
'Knox': {'avg': 892949, 'max': 2020000, 'min': 450000, 'sum': 58041688},
'Maribyrnong': {'avg': 906626,
'max': 2450000,
'min': 170000,
'sum': 115141500},
'Melbourne': {'avg': 1058382, 'max': 2370000, 'min': 370000, 'sum': 17992500},
'Moonee Valley': {'avg': 1072271,
'max': 2905000,
'min': 280000,
'sum': 277718200},
'Moreland': {'avg': 1016347, 'max': 2500000, 'min': 145000, 'sum': 193106000},
'Whittlesea': {'avg': 630756, 'max': 1405000, 'min': 345000, 'sum': 105336300},
'Yarra': {'avg': 1269376, 'max': 4011000, 'min': 345000, 'sum': 276724000}}
Example 3
>>> data_cleaned = read_data("sales_data_clean.csv")
>>> pprint(sale_stats_by_council(data_cleaned, '01/12/2016', '31/12/2016'))
{'Banyule': {'avg': None, 'max': None, 'min': None, 'sum': None},
'Brimbank': {'avg': 720000, 'max': 720000, 'min': 720000, 'sum': 720000},
'Darebin': {'avg': None, 'max': None, 'min': None, 'sum': None},
'Hume': {'avg': 411500, 'max': 455000, 'min': 368000, 'sum': 823000},
'Knox': {'avg': None, 'max': None, 'min': None, 'sum': None},
'Maribyrnong': {'avg': None, 'max': None, 'min': None, 'sum': None},
'Melbourne': {'avg': 735000, 'max': 735000, 'min': 735000, 'sum': 735000},
'Moonee Valley': {'avg': None, 'max': None, 'min': None, 'sum': None},
'Moreland': {'avg': 1050562, 'max': 1900000, 'min': 400000, 'sum': 16809000},
'Whittlesea': {'avg': None, 'max': None, 'min': None, 'sum': None},
'Yarra': {'avg': 1305583, 'max': 3335000, 'min': 370000, 'sum': 31334000}}