Twitter is a social networking website where users can post very short messages know as "tweets".
Each Twitter user can choose to "follow" other users, which means that they see those users' tweets.
A Twitter user sees the tweets of users they are "following", and their tweets are seen by their
"followers" (the users who follow them).
All the "follow" connections define a network among Twitter users, and it's quite interesting to look for
patterns in the connections. Tools like Twiangulate let you explore questions like "what connections do
my two friends have in common?". In this assignment, you'll write a program that lets you ask
questions (or "queries") about a Twitter dataset.
Any tool for exploring the Twitterverse must get its data from Twitter itself. Twitter provides an API to
allow programmers to write programs that interact with Twitter and extract data from it. In general, an
API is a module that defines functions for accessing underlying data and performing other tasks
without having to know how that data is actually stored and retrieved.
To make this assignment more manageable for you, we will assume that the information we need has
already been extracted from Twitter and stored in a file.
How to tackle this assignment
This is your first experience designing a program of this size. We are providing detailed advice to help
you break the task down into manageable pieces.
Make sure your twitterverse_functions.py module runs without error before submitting. If you have
syntax errors in your module, comment them out before submitting, or we will not be able to test your
functions!
The Twitter Data File
A Twitter data file contains a series of one or more user profiles, one after the other. Each user profile
has the following elements, in this order:
A line containing a non-blank, non-empty username. You may assume that usernames are unique;
that is, a single username will not occur more than once in the file, and that usernames do not contain
any whitespace.
A line for the user's actual name. If they did not provide a name, this line will be blank.
A line for the user's location, or a blank line if they did not provide one.
A line for the URL of a website, or a blank line if they did not provide one.
Zero or more lines for the user's bio, then a line with nothing but the keyword ENDBIO on it. This
marks the end of the bio, and is not considered part of it. (You may assume that no bio has the string
ENDBIO within it.) If the user did not provide a bio, the ENDBIO line will come immediately after the
website line, with no blank line in between.
Zero or more lines each containing the username of someone that this user is following, then a line
with the keyword END on it. (You may assume that no one has END as their username.) A user
cannot be on his or her own following list. You may assume that every user on a following list has a
user profile in the Twitter data file.
Notice that the keywords act as separators in this file. All of their letters are capitalised, and the
keywords contain no punctuation.
Examples
Here is a sample user profile that might occur among many in a file:
tomCruise
Tom Cruise
Los Angeles, CA
http://www.tomcruise.com
Official TomCruise.com crew tweets. We love you guys!
Visit us at Facebook!
ENDBIO
katieH
NicoleKidman
END
The file data.txt is a smallish example of a complete Twitter data file (and was made by hand) and the
file rdata.txt (see starter code) is a much larger example (and is made from real data extracted from
Twitter). These should help you confirm your understanding of the file format and will also be useful in
testing your program.
Cycles in the data
Although a user cannot be on their own following or followers lists, there can be "loops" (we call them
"cycles") such as this: user A can be following B who is following A. This is the shortest possible cycle.
Of course, cycles can be longer.
The Query File
Note that the word "query" just means "question". In computer science, we use it to mean a request
for information. For this assignment, a query will be provided in a file. Below we will review the high
level parts of the query, look at an example, and then describe the format of the query file.
Overview
A query has three components: a search specification, a filter specification, and a presentation
specification.
The search specification describes how to generate a list of Twitter usernames, starting with an initial
username (a list of length one) and then finding their followers or people they are following, then
people that are those people's followers or who they are following, and so on. When processing the
search specification, don't try to do anything to avoid cycles. For instance, if the search specification
says to find the people who user A is following, and from there the people they are following, you
could find yourself back at user A. Don't try to avoid that.
After processing the search specification, we have a list of Twitter usernames. Its length could be
zero. For example, if the initial username is 'adalovelace' and the search specification contains a
single 'followers' keyword, then the length of the list will be zero if 'adalovelace' has no followers.
The filter specification describes how to filter the list of usernames produced by the search
specification. The filtering can be based on
whether or not they are following a particular user,
whether or not a particular user is their follower,
whether their name contains a particular string (case-insensitive), or
whether their location contains a particular string (case-insensitive).
After processing the filter specification, we have a possibly reduced list of usernames.
Once the search results have been found and filtered, the presentation specification describes how
the output should be presented. It specifies on what basis the results should be sorted, and whether
the results should be presented in a short or long format.
Example query
Here is an example query:
SEARCH
tomCruise
following
following
following
FILTER
following c
location-includes CA
PRESENT
sort-by popularity
format long
The search specification in this particular query has four steps.
Start with a list containing the username to start the search from; i.e.,. ['tomCruise']. Let's call that list
L1.
The search keyword 'following' says to replace each username p in L1 with the usernames of the
users who p is following. This yields a new list, L2.
For the next 'following' keyword, we start with L2 and repeat the same operation as in the previous
step, yielding another list, L3.
For the final 'following' keyword, we start with L3 and repeat that operation one last time, yielding list
L4.
Notice that each step yields a list of zero or more usernames that is the input to the next step. There
should be no duplicates in the final results list. Duplicates should be removed after each step.
The Twitter data file diagram_data.txt (see starter code) contains the follower/following relationships
as represented by this diagram. For those relationships, the search specification above would yield
this list of usernames: ['i', 'j', 'h', 'k', 'tomCruise']. Make sure that you can see how the four lists, ending
with this final one, are generated. Notice that the final list contains the users you can get to in three
"steps" of the "following" relationship, starting from 'tomCruise'.
The final list generated by the search specification becomes the input to the filter specification. For our
current example, the filter specification says that the list should be filtered in this way: a user should
be kept only if they are following user 'c' and has a location that includes the string 'CA'. Notice that
the resulting list of usernames is just ['tomCruise'].
The presentation specification says to present the results in long format and to order the users
according to their popularity.