In this final piece of coursework

3 Coursework 3, overview
In this final piece of coursework, you will be working with a chunk of real-world data, namely
a collection of approximately 100,000 tweets harvested from Twitter (provided on Moodle in a
database-ready format). The data is provided as a collection of les used to create a \single giant
table" representation of the data, i.e., the scripts will create a single table with approximately
100,000 rows and a large number of attributes. These attributes will contain all the data about
the tweets, but the table will be very poorly normalised and will be an inappropriate design for
data storage and usage.
Your task in this coursework is to develop a more appropriate design for the database, write
SQL scripts to copy the data from the single table to the tables in your design, and construct
queries to answer some questions about the data set using your new tables.
In overview, the task can be broken down into the following steps.
1. Download the le \thetables.zip" from Moodle, and use it to create and populate the
\giant table" representation of the data.
2. Investigate the data, and consult the Twitter documentation, to understand its structure
and the meaning of its elds.
3. Document the
aws in the design, including but not limited to its poor normalization.
Where appropriate, this should ideally be done using the language of the theory of
database design, i.e., stating FDs (functional dependencies) considered to be present in
the data and showing that the giant table is not well-normalised (not in 3NF).
4. Document your new design for the data. Your new design should be in 3NF and should
addresses the design issues from the giant table that you identied above.
5. Write SQL scripts to create your new tables and to appropriately copy data from the giant
table into your new tables.
6. Explore the data using SQL queries, and write an SQL script that answers the questions
outlined at the end of this document.
Your submission should include all the SQL scripts created above, as well as a PDF le describing
the design issues, your new design, query results, and any other conclusions you may have.
These steps are now described in more detail.
3.1 The source data and database initialisation
The data represents a collection of tweets, with attributes essentially according to the Tweet ob-
jects described in the Twitter documnetation at https://developer.twitter.com/en/docs/
tweets/data-dictionary/overview/intro-to-tweet-json and subsequent pages. This data
has been stored in the table bad_giant_table as a \
at" giant table construction, where every
row of the table contains all the information relating to one tweet. The majority of attributes
in this table coorrespond directly to attributes of the Tweet objects described in the above link.
There is also a table bad_shorter_table containing a smaller set of tweets (40 tweets), in
case you want to investigate the data directly and not be overwhelmed by the amount. This
table has the same structure as bad_giant_table.
There are only a few dierences between the bad_giant_table data and the standard Tweet
object interface (other than the fact that some attributes are absent from the table):
2
Normally, hashtags are stored via lists of hashtag objects. In the table, there are instead
attributes hashtag1 through hashtag6 storing the text of the rst six hashtags of each
tweet. (Only few tweets have more than six hashtags.) Unused hashhtag attributes are
null.
Retweet information is also simplied: If a tweet is a retweet, then the attribute retweet_of_tweet_id
stores the tweet ID of the original tweet that is being retweeted. For tweets that are not
retweets, this attribute is null.
3.1.1 Loading the database
To create and populate the tables in PostgreSQL, download the le thetables.zip from Moodle,
copy it to a directory where it is readable from psql, and perform the following steps.
1. Execute the script bad-table-create.sql in your database (as in the tutorial, e.g., using the
\i command at the psql command prompt).
2. To copy the data from the smaller table, type
\COPY bad_shorter_table FROM bad_shorter_table.csv csv
at the psql prompt.
3. To cope the data from the nal (larger) table, type
\COPY bad_giant_table FROM bad_giant_table.csv csv.
This creates two tables, bad_shorter_table with 40 tweets, and bad_giant_table with roughly
100k tweets. The tables have identical schemas; the shorter table is provided to have an example
that can be investigated directly without being overwhelmed by data.
NOTE: In principle, the table data is stored in a format that should be application inde-
pendent (csv). However, it is always possible that there are minor dierences in the imple-
mentations between dierent DBMSs, even for a standard format. Therefore, there is a risk of
inconsistencies if the csv data is imported into a DBMS other than PostgreSQL.
3.2 The database redesign step
There are several issues with the design of the \giant table". It is not normalised, it does not
use constraints other than PRIMARY KEY, and some information is stored in an inappropriate
manner (especially the hashtags).
The tasks in this step of the coursework can be divided into two parts: First, describe the
design
aws of the giant table, including but not limited to the fact that it is poorly normalised;
then design a database schema that addresses these
aws (or argue why some particular
aw
cannot or should not be altered).
You should produce a document (in PDF format) that documents your conclusions about
any issues with the existing design and describes your redesign.
Functional dependencies and normal forms. For full marks, your description of why the
giant table is poorly normalised must be provided using statements in the form of functional
dependencies among the attributes. As always, these functional dependencies are statements
about the data model more than about the concrete table. Therefore, for such a description,
you will need to make some assumptions about the behaviour of data in the model.
Using these FDs, you will then be able to make formal statements about the normal form
(or lack thereof) of the giant table and of the tables in your own design.
3
Finally, although some care has been taken in producing the data set, it is still possible that
in some aspects, the data may be dirty. That is, the giant table may contain a few inconsis-
tencies, as artefacts of the poor normalisation and of how the data was originally collected. In
practice, this means that, unlike in the examples discussed in class, an FD statement can be
accepted as reasonable, even if it is to some small extent contradicted by the data. (However,
this may raise complications in the copying step described below.)
Your own design should be in 3NF (possibly with some very minor exceptions, which must
then be clearly argued for in your report).
3.3 Creating and populating the new tables
Having designed your new database schema, you are to create the tables and ll them with
data, copied from the giant table. The commands for doing this should be placed in an SQL
script as part of your submission. The table creation should also include the usual constraints
(including primary key, foreign key) as appropriate.
For the copying task, it will be very useful to know the INSERT INTO SELECT form of
the INSERT statement in SQL. In brief, a statement such as
INSERT INTO table1 (col1, col2, col3) SELECT col1, col2, col3
FROM table2 [WHERE ...]
will populate table1 with the results of the second query.
As a further hint, you may nd a need to avoid inserting data twice. Possibly the easiest
form of this is using DBMS-specic features sometimes referred to as upsert (insert or update).
In PostgreSQL 9.5 or later (which is running on the teachdb server), this can be handled by
appending a statement such as ON CONFLICT (attributename) DO NOTHING to the end of the
query, instruction PostgreSQL to skip those insertions which would have caused such a con
ict.
Details on this can be found in the PostgreSQL documentation at https://www.postgresql.
org/docs/9.5/static/sql-insert.html, or (less technically) the tutorial at http://www.
postgresqltutorial.com/postgresql-upsert/.
For a portable version, some situations can be handled by the DISTINCT keyword; others
may require more complex solutions.
3.4 The queries and reporting
Finally, write SQL queries over your database for the following questions. Attempt to construct
the queries without using hard-coded constants (such as the total number of tweets).
Tweets, users and languages:
{ How many tweets are there in total?
{ How are these tweets distributed across languages? Write a query that shows, for
every language (user_lang) the number of tweets in that language.
{ Compute, for each language, the fraction of total tweets that have that language
setting, as well as the fraction of the number of users that have that language setting.
Note that these points (per-language) should be computed by a single query (i.e., writing
a separate query for every language is not a valid solution).
Retweeting habits:
{ What fraction of the tweets are retweets?
4
{ Compute the average number of retweets per tweet.
{ What fraction of the tweets are never retweeted?
{ What fraction of the tweets are retweeted fewer times than the average number of
retweets (and what does this say about the distribution)?
Again, try to write this without hard coding any numbers.
Hashtags:
{ What is the number of distinct hashtags found in these tweets?
{ What are the top ten most popular hashtags, by number of usages?
{ Write a query giving, for each language, the top three most popular hashtags in that
language.
Replies:
{ How many tweets are neither replies, nor replied to?
{ If a user user1 replies to another user user2, what is the probability that they have
the same language setting?
{ How does this compare to the probability that two arbitrary users have the same
language setting?
Throughout, you may create views that support your queries.
IMPORTANT NOTE: All queries must be written over the database that you design. You
are not allowed to use the table bad_giant_table in your queries.
4 Your submission
In summary, your submission must contain the following:
A plain-text le with SQL instructions for creating the tables of your database design.
A plain-text le with SQL instructions for copying data from bad_giant_table to your
new tables.
A report in PDF format that describes the design
ats of the giant table, ideally including
statements about functional dependencies and normal forms, and which documents the
design of your new database schema.
A plain-text le with SQL instructions for queries over the data.
Finally, the report should also contain the outputs of the queries (and, if applicable, your
comments on the questions asked, any other conclusions you may have drawn, etc.)
5 Marking criteria
Marks are provided for each step separately, distributed as follows (100 marks total):
Design issues with the giant table design, including statements about FDs and normal
form: 15 marks.
5
A normalised redesign of the above (normalization), at least in 3NF, and the description
of the design: 15 marks.
The SQL CREATE TABLE instructions: 10 marks.
The SQL instructions to copy the data: 20 marks.
The SQL queries over the new data and conclusions: 40 marks.
6