The goal of this document is to walk you through more RSQLite code and
work with Databases in R. This document will show you how to handle SQL in
a more realistic setting.
You need to submit the solutions to all of the lab questions by Monday
at 9am. It is recommended that you use R Markdown for working through
assignments. Include code and output for all questions. Just make sure that
you format the output so as to make it possible for the marker to understand
what you are doing. Stray pieces of output that the marker needs to hunt for
won’t make the marker happy. Make sure that you clearly present your results.
1 Reading Olympic Dataset into R
On an exam, and in real life, you’ll be given a database and will be expected
to know how to connect to it and which libraries to use, how to access the
pieces,. . . For this week we’ll give you some code that you should already know
and a table which is already set up. Download the le "stat240Week4lab.sqlite"
and connect to the database.
library(RSQLite)
library(DBI)
dbcon = dbConnect(SQLite(), dbname = "stat240Week4lab.sqlite")
# Find the names of the tables in the
# stat 240 database:
# dbListTables(dbcon) Some of them are
# from last week.
Let’s get some info about the new table! What we did last week actually
loads the entire database into R and runs code on that. This week we are
using more sophisticated code that performs a query but doesn’t load the entire
database into local memory. When the data le is realistically large, you will not
be able to load the entire table into R and will need to use a smarter work ow.
We’ll use PRAGMA to get some table info but note that unlike most other
queries we’ve used, PRAGMA is speci c to SQLite. MYSQL and others have
di erent ways of doing this. Let’s start with exploring the table: WinterO.
1
query_table_info = "PRAGMA table_info( TableNameGoesHere )"
dbGetQuery(dbcon, query_table_info)
PRAGMA gives a data frame. with one row for each column of the database
table. The data frams includes the column id, the column name, the ype of data
(integer, text,...), and the notnull column states if the column has any NULL
values. dflt_value gives the default value if there is one.
2 SQL Query Distinct
In a table, a column may contain many duplicate values and sometimes you
only want to list the di erent/distinct values. Here that could imply extracting
just unique Olympic winners or the distinct years of Olympics in the table. The
DISTINCT keyword can be used to return only distinct/unique values from a
SQL database.
Let’s say we want to know the unique years in the dataset. For that purpose
we could use the following SQL Query.
sql_dstc = "SELECT DISTINCT year, place FROM WinterO"
3 dbGetQuery
The next challenge is communicating with the SQL database using R com-
mands/functions. For this pupose we could the function dbGetQuery as we did
last week. This function sends a SQL query, retrieves results and then clears
the result set. This function is for SELECT queries only.
dbGetQuery(dbcon, sql_dstc)
4 SQL Query Order By
We have several unique places in the table. But the locations and years may
not be as clear. To clarify we could use the command ORDER BY to sort the
result-set by one or more columns. ORDER BY variablename sorts the records
in ascending order by default. To sort the records in a descending order, you
can use the option DESC. The syntax then becomes: ORDER BY variablename
DESC or ORDER BY variablename1, variablename2,... DESC
sql_ord = "SELECT DISTINCT year,place FROM WinterO ORDER BY place DESC"
dbGetQuery(dbcon, sql_ord)
Some cities (Innsbruck, Lake Placid, St. Moritz) appears multiple times
since they hosted the Olympics more than once each. The above command asks
for DISTINCT combinations of year and place.
2
Question 1 (total 4 points)
a) Print the unique set of Countries of Olympic athletes from Olympic
Medals table in the database and sort them from z to a.[2 points]
b) Print the sorted unique countries of athletes whose last names begin with
the rst 2 letters of your last name. If you end up with 0 matches, try with
fewer letters or some other fragment of your name. Sort the output by country
(Recall that SQL is invariant to upper and lower case letters) [2 points] (hint:
print the athletes names as well to make sure you are obtaining what you think
you should be obtaining.)
5 dbSendQuery and dbFetch
The function dbSendQuery submits and executes the SQL query to the database
engine, but it does not extract any records. For that you need to use the func-
tion dbFetch, and then you must call dbClearResult to clear the results when
you nish fetching the records you need.
On the other hand, dbFetch fetches the next n elements (rows) from the
result set and returns them as a data.frame. This makes it easier for you to
extract results if all you want to do is print the rst ’n’ results because it avoids
the need for loading all of the results into R and then doing head or tail. You
should know these types of tricks for getting results without loading the entire
table into R - especially when working in real life (and for exams).
From the population and CA tables if we wanted to select all populations of
regions in Saskatchewan in 2011 this would become:
sql_pop = "SELECT Population__2011, Region FROM CA INNER JOIN
POP2011 ON CA.Geographic_name=POP2011.Geographic_name WHERE
province == Saskatchewan "
QuerryOut = dbSendQuery(dbcon, sql_pop)
QuerryOut
Notice that QuerryOut describes results but doesn’t actually give results.
That’s where we need to use dbFetch.
dbFetch(QuerryOut, 5)
Running the last command again will give the next rows in the set:
dbFetch(QuerryOut, 5)
To get all remaining rows set the number to -1:
dbFetch(QuerryOut, -1)
Use dbClearResult to clean up afterwards otherwise you could end up with
a lot of memory clogged with old querries.
3
dbClearResult(QuerryOut)
Question 2 (10 points)
a) The Exch table contains information about SFU’s exchange partners in the
2017/2018 academic year. Write a query to extract and count the number of
times each country appears. Save those results in R so that they can be reused
in part b [4 points]
b) Find the countries from part A that have exactly 3 exchange school part-
ners (1 point).
c) Write a query to obtain all of the events, athletes and years for medal
winners from the countries in part b and order them by the event name. Save
the results of your query in an object. Use dbFetch three times in a row, each
time retrieving another set of two names and print them. Finally retreive all
remaining values. (submit your SQL query and the console output). (hint: It
would be wise to also obtain the country in your querry to ensure a correct
answer.)
Again, use "dbClearResult()" to clean up afterwards otherwise you could
end up with a lot of memory clogged with old querries.
dbClearResult(athls_names)
6 SQL Query INSERT INTO
The INSERT INTO statement is used to insert new records in a table.
Assume that new regions called Statsville and Statsville240 have recently
been founded on top of a mountain in British Columbia and that we wish to
add it to our CA table. We need to ll in all the values for the row.
sql_ins = "INSERT INTO CA
( Country, Geographic_name, Region, Province, Prov_acr,
Latitude, Longitude)
VALUES ( CA , V5A , Statsville , British Columbia , BC ,
49.278417 , -122.916454 ),
( CA , V5A , Statsville240 , British Columbia , BC ,
49.278417 , -122.916454 )"
dbSendQuery(dbcon, sql_ins)
Note that the output will tell you how many rows have been changed. To
see them you will need to use the extraction querries from before. Note that
we didn’t input values into all the columns. Those columns are lled with the
default value (as per whatever we found when using PRAGMA last.)
sql_ins3 = "SELECT * FROM CA WHERE Region LIKE Statsville% "
dbGetQuery(dbcon, sql_ins3)
Question 3 (14 points)
4
Denny Morrison is a Canadian Speed skating athlete. In 2014, he won
some medals for Team Canada. His events and results are here: https://
en.wikipedia.org/wiki/Denny_Morrison.
a) Add Denny Morrison’s 2014 Medal events to the table. Then obtain all
Entries for Denny Morrison. ( 4 points for the SQL code, 2 points for output;
total 6 points) Hint: It may help to look up another Canadian speed skating
legend, Gaetan Boucher in the SQL table because he won in other years in those
same events.
b) Let’s pretend that you won Olympic Medals. Enter yourself in the table
three times with di erent Olympic Medals for di erent events of your choosing.
Set the years to 1988, 2010, and 2014 (6 points) Submit your SQL code. If you
actually won a medal(s), rst check if you are in the table and if not then put
in your actual result(s) instead of making them up.
c) Now print and show the records from the Olympic Medals table for all
athletes who share your last name (2 points for SQL code and output).
7 SQL Query DELETE
The DELETE statement is used to delete records in a table.
Let’s say I want to delete the Statsville entry I inserted to the table.
sql_del = "DELETE FROM CA WHERE Region == Statsville "
dbGetQuery(dbcon, sql_del)
Now when you retrieve the results you will see that Statsville240 remains:
sql_ins3 = "SELECT * FROM CA WHERE Region LIKE Statsville% "
dbGetQuery(dbcon, sql_ins3)
Question 4 (6 points )
a) Delete your 2010 entry in the Olympic Medals data table. (3 points for
SQL code and R console output)
sql_del2 = "DELETE FROM WinterO WHERE year== 2014 AND Athlete == Dave Campbell "
dbGetQuery(dbcon, sql_del2)
## Warning in rsqlite fetch(res@ptr, n = n): Don’t need to call dbFetch()
for statements, only for queries
## data frame. with 0 columns and 0 rows
b) Now print the records from the table that has your name in the Athlete
column. (3 points for SQL code and R console output)
sql_del3 = "SELECT * FROM WinterO WHERE Athlete == Dave Campbell "
dbGetQuery(dbcon, sql_del3)
## [1] place year Country Medal Event
## [6] Athlete Record
## (or 0-length row.names)
5
8 SQL Functions
Let’s focuss on some other SQL functions that could be useful to extract sum-
maries from the tables.
GROUP BY: The GROUP BY statement is used in conjunction with the
aggregate functions to group the result-set by one or more columns.
COUNT: The count function returns the number of rows that matches a
speci ed criteria.
SUM: The SUM function returns the total sum of a numeric column.
HAVING: The HAVING clause was added to SQL because the WHERE
keyword could not be used with aggregate functions, treat HAVING in a
aggregation querry as you would WHERE in a SELECT query.
Lets see how many data records are there for each year in the data set. We
will name the new column of the data frame. "TotalNumber"
# Try this:
sql_count = "SELECT year, Count(year) AS TotalNumber FROM WinterO"
(count = dbGetQuery(dbcon, sql_count))
# See how that doesn t do what we
# want?
# And now Group by year:
sql_count = "SELECT year, Count(year) AS TotalNumber FROM WinterO
GROUP BY year"
(count = dbGetQuery(dbcon, sql_count))
# Plot the results so that it is
# easier to see
plot(count$year, count$TotalNumber, xlab = "Year",
ylab = "Count", type = "b", main = "Total Medal Count per Year")
Lets try to nd the total number of medal athletes for each event in the
’Event’ column per Olympic Games and name the new output column Tots.
sql_sum = "SELECT Event, place,year,count(Event) AS Tots FROM WinterO
GROUP BY Event, year"
sum = dbGetQuery(dbcon, sql_sum)
sum
We can do the same by restrict ourselves to the current millenium:
sql_sum = "SELECT Event, place,year,count(Event) AS Tots FROM WinterO
GROUP BY Event, year HAVING year>=2000"
sum = dbGetQuery(dbcon, sql_sum)
sum
6
Question 5
a) Find the population of each province from the 2006 census. (4 points for
SQL code and console output.)
b) Produce a density plot of the 2011 Total private dwellings counts for all
postal code regions in Ontario. (2 points for query, 1 for plot, 3 for labels: total
5points)
b) Produce a new density plot for the 2011 Population counts for all postal
codes. Your plot should have one line for each province and territory in Canada.
(2 points for query, 2 for plot, 3 for labels, 2 points for legend: total 9points)
hint: it may look nicer if you use the Prov acr rather than the full province
name in the legend.
Finally disconnect from the Database.
dbDisconnect(dbcon)