The goal of this document is to give you advanced tools for working with
database queries. By now you should already know how to connect to a database
and how to make some nice plots. You will be expected to use e cient queries on
exams. In other words the strategies used in the rst SQL lab generally brought
the entire database into R memory and then performed calculations/actions.
Last week and beyond we are working on extracting just what we need and only
importing the deisred rows and columns into R.
You need to submit the solutions to all of the lab questions by Monday
at 9am. It is recommended that you use R Markdown for working through
assignments. Include code and output for all questions. Just make sure that
you format the output so as to make it possible for the marker to understand
what you are doing. Stray pieces of output that the marker needs to hunt for
won’t make the marker happy. Make sure that you clearly present your results.
1 Connecting to a Database
Connect to the "stat240Week5lab.sqlite" database. Then print the list of tables
in the database just for your reference. Remember to call the usual database
and SQL libraries. You can obtain table information using PRAGMA as we did
last week. This week we have two Olympic datasets, Summer and Winter. In
the Summer table NOC is a 3 letter country code, see https://en.wikipedia.
org/wiki/List_of_IOC_country_codes.
2 Moving Averages with SQL
In statistics, a moving average (rolling average or running average) is a calcula-
tion to analyze data points by creating series of averages of di erent subsets of
the full data set. Usually the mean is taken from an equal number of data on
either side of a central value. Sometimes we use the average of the previous few
years since we do not yet have information moving forward. Let’s look at how
to obtain moving with SQL Queries.
Lets try to obtain moving averages for total number of athletes who obtained
medals regardless of the country. For that rst we need to create a Frequency
table where we need a unique list of year with its corresponding total medal
counts. To do this we will create a new virtual table as an interim step. We’ll
split up the task into small pieces so that you can see what is happening.
Obtain the medal count within each year.
1
mov_avg1 = "SELECT year, Count(year) AS TotalNumber FROM
WinterO GROUP BY year"
out = dbGetQuery(con, mov_avg1)
Question 1a Make a nice plot of the number of athletes who obtained
medals per year. Use the option type="p". Make sure axes and the title
are nice and clear (total 5 points: 4 points for the plot + 1 point for your
line of code).
Create a virtual table (called a VIEW ) which we will call tot meds to
house that output from the previous query without bringing the results
into R. This will let us do other queries on that virtual table.
mov_avg2 = "CREATE VIEW tot_meds AS SELECT year,
Count(year) AS TotalNumber FROM WinterO GROUP BY year"
dbSendQuery(con, mov_avg2)
Question 1b Check that the above worked by nding the new virtual
table inside the database (1 point for code and console output).
Do this to delete the temporary table so that you don’t get stuck with
errors when you rerun your code. If you run this you will need to re-run
the previous step to re-build the virtual table.
dbSendQuery(con, "drop view tot_meds")
Query the virtual table to extract year and total medal.
mov_avg3 = "SELECT year, TotalNumber FROM tot_meds"
out = dbGetQuery(con, mov_avg3)
Question 1c Add a line plot of the total number of athletes who received
medals per year from this table as an addition to the plot in part 1a. Make
sure you chose a di erent colour for the line. (total 2 points. 1 point for
the line, 1 point for the code)
Now we will compute a moving average by de ning an index within the
year and averaging everything within that window. This requires us to
de ne two new tables t1 and t2. Remember that we can specify where
a variable comes from by using ".". So t1.year means that we use the
variable year from table t1.
We will use table t1 to de ne an index over year and use that index to
average everything within the window:
WHERE t2.Year BETWEEN (t1.Year-4) AND (t1.Year+4) .
Note that the moving average will contain more or less Olympics within
the window depending on the year. The reason is that the Olympics were
occasionally cancelled due to world wars.
2
check = "SELECT * FROM tot_meds AS t,
(SELECT t1.year, AVG(t2.TotalNumber) AS mavg
FROM tot_meds AS t1, tot_meds AS t2
WHERE t2.year BETWEEN (t1.year-4) AND (t1.year+4)
GROUP BY t1.Year) sq WHERE (t.year = sq.year)"
movingAvg = dbGetQuery(con, check)
Question 1d Put a line with the moving average on the gure
and make a legend(total 4 points; 2 points for the line, 2 points
for the legend).
Now, make sure that you delete the temporary table.
3 The number of medals won by team China in
2004
Now let’s look at the number of medals won by team China in the summer
Olympics.
meds = "SELECT Edition as Year, Count(Edition) AS TotalNumber FROM
Olymp_meds GROUP BY NOC, Edition
HAVING NOC == CHN AND Year == 2004"
out = dbGetQuery(con, meds)
Question 2a, 2 points; max 1 sentence Why does our number dif-
fer from the one on https://en.wikipedia.org/wiki/China_at_the_2004_
Summer_Olympics? The biggest danger to having a lot of di erent techniques
with little understanding of what they are actually doing is that it is easy to
get an answer but hard to know if your answer is correct. In assignments and in
exams, make sure you know if your answer is correct - or at least how to check
your work.
Question 2b, 6 points (5 for code, 1 for output proving you have
the right answer) Make a SQL query that will provide one row for each medal
won in the 2004 Olympics by team China. (Hint: Make sure that you get the
right number of each medal type when comparing to https://en.wikipedia.
org/wiki/China_at_the_2004_Summer_Olympics.)
Question 3 a) Create a plot of total number of medal events for
a country of your choosing (that country must participate regularly
in the Olympics going back to 1948 or earlier and earn at least 1
medal per summer or winter Olympics. Be careful about using NOC
or Country.) Add a line with the moving average of the current and
past 12 years, so that we are not averaging into the future. Add
another line with the moving average of with a window containing
the past and next 4 years. Finally include a horizontal like with the
average number of medals for that country across all years. Include
the usual things to make the plot readable and (obviously) make sure
that we know what is in the plot (total 8 points.)
3
4 SQL summary statistics
The objective here is to Extract Summary Statistics from the values in the table
instead of bringing the table into R and then producing summary statistics.
First the obvious summaries by making a VIEW with data from team
Canada. We will consider only the number of athletes who received medals.
Some summaries can be obtained directly from SQLite:
Canadian = "CREATE VIEW Can_tot_meds AS SELECT Edition AS Year,
Count(Edition) AS TotalNumber FROM Olymp_meds GROUP BY NOC,
Edition HAVING NOC == CAN "
dbGetQuery(con, Canadian)
# dbGetQuery(con, drop view
# Can_tot_meds )
summaries = "SELECT COUNT(Year) AS YearsInOlympics,
AVG(TotalNumber) AS AVGmedalcount , MIN(TotalNumber) AS MINmedalcount ,
MAX(TotalNumber) AS MAXmedalcount FROM Can_tot_meds"
(out = dbGetQuery(con, summaries))
Now to get the median. Recall that to get the median we order all the values
and take the one in the middle (1=2 of the way through the sorted values). We
actually do that by using LIMIT to limit the results to give us a single value
and then asking for a speci c ordered value using OFFSET to de ne the value
in the middle of the sorted values:
getmedian = "SELECT TotalNumber AS Median FROM Can_tot_meds
ORDER BY TotalNumber LIMIT 1 OFFSET (SELECT COUNT(TotalNumber)
FROM Can_tot_meds) /2"
(out = dbGetQuery(con, getmedian))
Note that SQL expects the OFFSET to be an integer and therefore expects
(except in very speci c exceptions) that the OFFSET be determined by using
integers and not decimals. So basically dividing by 10 works but multiplying by
.1 doesn’t.
Question 4 (5 points for code, 5 points for numbers)
Use SQL to obtain the (10;30;50;70;90)th percentiles of the total number of
athletes who won medals earned over all countries with medals (i.e. all countries
in the table ’WinterO’) for the years when the Olympics were in Lake Placid
(Hint, this is easiest if you split this into 5 queries).
5 Doing ’Fancy’ Math Within RSQLite
The function initExtension, loads some math extensions into the database so
that we can get SQL to do mathematical queries. For example the Standard
Deviation (stdev) is included in those extensions.
initExtension(con)
dbGetQuery(con, "SELECT STDEV(TotalNumber) FROM Can_tot_meds")
Find all of the functions this allows us to use by looking at the help le
(within R) for initExtension.
Let’s Standardize the populations of Canadian regions. To obtain a stan-
dardized score we wish to look for deviations from typical values that scaled by
within the usual variation.
Z-Score(x) = x xpvar(x)
We can obtain the standardized values of the Canadian populations within
each region but we need to do this calculation on every row of data. We could use
a VIEW table. Or we could de ne a temporary table within the SQL argument.
We do that here using WITH to de ne the temporary table pop_cnt. Then we
call a variable within that table using the notation pop_cnt.variable. You can
check partial results by running the SELECT command within the brackets.
That will show you what you will get as the temporary table pop_cnt.
sql_z = "WITH pop_cnt AS
(SELECT avg(Population__2011) AS mean,
stdev(Population__2011) AS sd FROM POP)
SELECT Geographic_name, POP.Population__2011,
pop_cnt.mean, pop_cnt.sd,
(POP.Population__2011 - pop_cnt.mean) / pop_cnt.sd AS z_score_pop
FROM pop_cnt, POP"
out = dbGetQuery(con, sql_z)
Note that the rst value of the POP table is the population of Canada. We
can use OFFSET to skip ahead to the second value. The problem is that LIMIT
must be used with OFFSET. LIMIT returns a limited number of rows, but we
want all of them. Remember how we obtained all remaining rows when using
dbfetch? Use that trick to x the code below which otherwise calculates the
standardized values.
Question 5, total 5 points (2 points for making a histogram of the
standardized populations and including a nice title and axis label. 3
points for the SQL code. Hint: Make sure you look at the rst few
values in the table to make sure you have the population within small
regions and not something much larger. If there are any rows you
need to exclude, for full points your SQL code should do that for you.