Requirement
CS1210 Computer Science I: Foundations
Project 1: Educational Data
Part 1 due Friday, November 18 at 11:59PM
Important
CSI projects represent individual work; there are no partnerships. Everyone is responsible for their own
work. Under no circumstances should you turn in the work of someone else (including any code or
material you may find on the Internet) as your own, nor should you share your own work with others.
You may discuss general concepts with your classmates as long as these discussions do not lead to the
actual exchange of code fragments or written solutions.
You hav e several sources for help. Your first recourse should always be to post your question on the ICON
discussion board. This is the fastest place to go for clarifications or disambiguation, or for help with
Python in general (remember, don’t post your solution or any part of your solution). Second, if you must
share a portion of your code, you can always attend a TA help session (see the announcement on ICON
entitled General TA Office Hours start Thursday, September 1 for times and locations). You may also
attend my office hours Tuesdays from 1:30 to 3:30 in my office (14 MLH). Finally, you may email your
code with a specific question to me or to your TA, but please be sure that you include CS1210 on the
subject line (also, this is the slowest way of getting help, due to the volume of mail we are likely to
receive).
Introduction
For this two-part project you will be working with a medium-size dataset (about 800K records) of
education statistical indicators obtained from the World Bank. The dataset consists of two file:
f308b71c-00be-4519-9b0c-0d3100b75092_Data.csv
f308b71c-00be-4519-9b0c-0d3100b75092_Definition and Source.csv
both in csv or comma-separated value format. The first file contains the data, with each record having five
fields, described in the first line of this file as:
Co u n t ry,Co u n t ry Cod e , Se r i es Cod e , 2015 [YR 2015 ]
The next 762,000 lines contain records of this form, while the last five lines look like:
, , , ,
, , , ,
, , , ,
Da t a f rom da t aba s e : Edu ca t i on St a t i s t i cs - Al l Ind i ca t or s , , , ,
La s t Upd a t ed : 10 / 04 / 2016 , , , ,
and can safely be ignored (these lines are either blank — hence the rows of commas — or contain data
provenance information).
The second file contains information about the data contained in the first file. Each line is a record
consisting of four fields, described in the first line of this file as:
Co de , Ind i ca t or Name ,Lo n g defin i t i on , So ur ce
Each of these lines can be quite long, especially (and not surprisingly) the field called ‘‘Long definition.’’
Using the elipsis (‘‘…’’) to shorten the line, a sample field (the 10th line of the file to be precise) looks
like:
UI S .NERA. 3 , “Ad j us t ed . . . “ , “To t a l . . . “ ,UN ESCO Ins t i t u t e for St a t i s t i cs
1 Revised November 14, 2016
Aside from the textual descriptions, the most interesting field here is the first, which by design contains
interesting information about the type of data represented, and which is also appears as the third field in
the data file. The key idea is that records in the data file that share the same ‘‘Series Code’’ (or simply
‘‘Code’’ as its called in the definitions file file) are by definition directly comparable. So, for example, if I
am interested in the adjusted net enrollment rate for females in primary school (SE.PRM.TENR.FE)
across different countries or populations, I could compare all the records from the data file with this
‘‘Series Code’’ (there are 254 of these).
Reading Data
Your first task is to read these data into Python and construct an appropriate representation of these
records. You will write two functions to achieve this end.
First, write a function
de f r eadDe fin i t i ons (fil ename ) :
which opens file filename of the second type described above, that is:
Co de , Ind i ca t or Name ,Lo n g defin i t i on , So ur ce
and returns a dictionary D with entries of the form.:
Co de : [ Ind i ca t or Name , Long defin i t i on , Sou r ce ]
Reading csv files can be tricky, because it is quite possible that some fields may contain explicit commas
embedded in a single field (see, e.g., the ‘‘Long definition’’ field). In such cases, these extraneous commas
are protected by enclosing the field in quotes. For example, consider the following line taken from a
hypothetical csv file representing Olympic gold medalists:
Fe n c i ng , “Ga roz zo , Dan i e l e” , ITA , 2016
Here, the correct interpretation is a single record (or line) containing four fields (and not five) because the
comma in the gold medalist’s name is not semantically equivalent to the other three commas in the line.
Fortunately, because this is such a common issue, Python provides a library for reading csv files while
honoring the semantics of commas embedded in quoted fields. To use the library, you must first:
impo r t csv
You will need to read up on how to use the csv library: more information can be found here:
h t t ps : / / do cs . py t hon . org / 3 . 5 / l i br a ry / csv . h tml
Important: do not attempt to split() the line, as I can guarantee this will fail. To be successful, you will
need to use the csv library. Also, be wary of the first line in the csv file, which is a ‘‘header’’ and should
not be included in the dictionary as data.
The second function you should write:
de f r eadDa t a (fi l ename ) :
should open a file filename of the first type above, that is:
Co u n t ry,Co u n t ry Cod e , Se r i es Cod e , 2015 [YR 2015 ]
and return a tuple of two dictionaries, C and V. The first dictionary, C, should have entries of the form.:
Co u n t ry Cod e : Coun t ry
and therefore should look like, e.g.,
2 Revised November 14, 2016
{ ’US A’ : ’Un i t ed St a t es ’ , ’ ITA ’ : ’ I t a l y’ , ’PRT ’ : ’Po r t uga l ’ , . . . }
while the second dictionary, V, in the tuple should have entries of the form.:
Se r i es Cod e : { Co u n t ry Cod e : 2015 [YR 2015 ] , . . . }
and thus look like, e.g.,
{ ’UI S .LR.AG15T9 9 .GPI ’ : { ’ARE ’ : ’1 . 02945005893707 ’ , . . . } , . . . }
Again, be wary of the fact that the first line in this csv file is also a ‘‘header’’ and should not be include in
the dictionary as data, but nor should the last five lines of this file, which represent summary information
and not real data.
In constructing the readData() function, it is important not to include entries where the corresponding
value is missing (missing values in this data file are indicated by a string, ’..’); there will be many fewer
values than the 762000 rows in the data file might imply. To get to the heart of the matter, you will write
a third function:
de f ma keP rofi l es (C, V) :
which will return a dictionary, P, indexed by country code with values consisting of the number of
datapoints present in V for that particular country. Thus part of P produced by your first implementation
of makeProfiles() might look like:
{ ’MA R’ : 307 , ’ IBT ’ : 0 , ’WS M’ : 293 , ’URY ’ : 269 , . . . }
Indicating that there is no data about ’IBT’ present in this dataset. You will note that ’IBT’ is defined in
C as ’IDA IBRD total’ rather than some country name; IDA and IBRD stand for International
Development Association and International Bank for Reconstruction and Development, respectively, two
branches of the World Bank from which these data were obtained. Given that no data is provided, your
complete version of makeProfiles() should alter C to remove the entry for IBT and any other similar code
that has no data associated with it. Thus the value of P produced by your final version of makeProfiles()
might look like:
{ ’SMR’ : 14 , ’UMC ’ : 28 , ’ARG ’ : 263 , . . . }
should not contain any entries with 0 values, and should also modify C to remove those entries. If your
code is like mine, final versions of both P and C should contain 241 entries.
Finally, you will implement a plotting function:
de f p l o tPr ofil e (P) :
that will reproduce the figure shown here below.
Note how all of the country codes are arranged alphabetically along the x axis, with the y values
corresponding to the number of datapoints for that particular country code. You will need to study the
matplotlib pyplot documentation:
3 Revised November 14, 2016
h t t p : / /ma t p l o t l i b. org / ap i / pyp l o t _ap i . h tml
to learn how to make your code work.
Once this code is complete, we’ll turn our attention to a more interesting analysis of the data.
4 Revised November 14, 2016