Important!
As outlined in the course syllabus this homework is worth 6% of your nal grade. The maximum attainable
mark on this homework is 120. As was also outlined in the syllabus, there is a zero tolerance policy for
any form. of academic misconduct. This is assignment may be done individually or in pairs, as outlined in
the syllabus. Include a cover page, as described in the syllabus, with your assignment.
By electronically uploading this assignment to Blackboard you acknowledge these statements and accept
any repercussions if in any violation of ANY Purdue Academic Misconduct policies. You must upload your
homework on time for it to be graded. No late assignments will be accepted.
1. As discussed in lecture Information Theory is critical to understanding and developing communication
systems. One particularly important application is the e cient encoding of information, for instance
to minimize the total number of bits required to send an electronic message or save a le on a hard
drive. The mapping of symbols from your original message (i.e., in alpha-numeric) to symbols in a target
alphabet (e.g., binary) is called an encoding, and a variety of algorithms exist. You will learn one such
encoding, de ned by the Shannon-Fano algorithm, which is used in some .zip le algorithms. Before
doing so, some basic (Shannon) information theory concepts will need to be introduced.
information: represents the certainty about the probability of the outcome of an event. The event
itself will have several possible outcomes that can be modeled by a random variable X. Each of the n
possible outcomes xi=1;:::;n2X has a probability P(X = xi) = pi of occurring.
information content: can be viewed as how much useful information is contained in xi2X. Shannon
derived the measure I(xi) = logb
where b corresponds the basis units in which information is
measured (e.g., b = 2 means information is measured in bits).
entropy: is essentially the expected amount of information from an event, and can be calculated as
encoding: a mapping from one symbol set to another. For instance, standard ASCII encoding is used
to represent characters in a computer by assigning each one an 8-bit pattern, i.e., A=01000001.
compression ratio: as mentioned above, information can be represented using di erent coding schemes,
and it is possible to compare between them. One approach is to calculate the compression ratio as
uncompressed size
compressed size . Compressing a 10MB le to 1MB will have a compression ratio of 10:1. Compression isimportant as it allows for better use of resources (hard drive space, time to save or transmit les, etc.).
space saving: is the relative reduction in overall size of information encoded with di erent methods,
calculated as 1 1compression ratio.
Now we can outline the Shannon-Fano divide-and-conquer algorithm to create an encoding (represented
as a tree) for message M composed of symbols from alphabet X:
1. Sort symbols of X according to their decreasing probability of occurrence in M.
2. Recursively divide X into two parts, where each part has approximately the same probability mass.
The left branch is given a 0, the right a 1.
Wikipedia has a nice example en.wikipedia.org/wiki/Shannon-Fano coding for how this process works.
(a) (1 point) Create a message M by concatenating your rst name(s), last name(s), favorite food(s)
and your city(ies) of birth. Include a space between each. Assignment partners must use
BOTH members information.
(b) (2 points) Encode S using the ASCII binary encoding given in www.eso.org/ ndelmott/ascii.html.
(c) (4 points) Using the Shannon-Fano algorithm create an encoding tree for M. Clearly show all of
your work (this can be done by hand and scanned - as long as it is legible!)
(d) (5 points) Create an R script. to compute, for each symbol xi (letter, space or other characters) in
X, the frequency, P(X = xi), I(xi), its Shannon-Fano binary code, and the total number of bits
required to encode each xi. Only the binary code can be hard-coded (from part (c)).
This should require no more than 5 lines of R code.
(e) (2 points) Use R to compute the compression ratio and space saving using Shannon-Fano versus
ASCII encodings of M.
This should require no more than 2 lines of R code.
2. This question will provide some insight into network analysis using a subset of the Internet infrastructure
from July 2006. You will use the R igraph package for this question. Aside from the included R
documentation a good additional web reference/tutorial is kateto.net/networks-r-igraph.
(a) (2 points) Read in the edgelist provided on Blackboard and plot the undirected network such that
there are no vertex labels (vertex.label=NA), and each vertex is of size 1 (vertex.size=1). Use
the plot layout layout.drl.
This should require no more than 2 lines of R code.
(b) (4 points) Plot the cumulative degree distribution in a log-log scale plot such that the x-axis is
ordered in increasing node degree (the number of edges connected to the node).
This should require no more than 2 lines of R code.
(c) One important societal concern, especially for critical infrastructure such as the Internet, is the
ability to sustain functionality due to random node failure or targeted node attacks. In both of
these cases the impacted nodes are essentially deleted from the network.
Assuming that network functionality is related to the size of the largest connected component (the
largest set of nodes connected by edges such that a sequence of edges exist that connect any two
nodes in the subgraph) of the residual network (i.e., remaining network after deleting nodes that
failed/were attacked). The set of nodes to be removed will be denoted as R, and the following
questions will examine di erent strategies for constructing it.
i. (2 points) Use the clusters function to con rm that the original network has one connected
component.
This should require no more than 1 line of R code.
ii. (10 points) Create a function remove.nodes(sizes, g, method=c("random","degree")) that
takes a vector sizes containing the number of nodes to remove, the original network g, and the
method to use when selecting which nodes to delete. If method="random", R is to mimic random
failures in the network. That is, R is constructed by randomly selecting an appropriately sized
subset of network nodes (see sample.int). If method="degree" then R is more strategically
selected, mimicking a targeted attack that greedily selects nodes based on their degree. The
function will return a vector (of same length as sizes) that indicates the associated largest
connected component after node deletions.
The function should require no more than 8 lines of R code. HINT: You will probably
need to use lapply a few times, and also the unlist function may be useful.
iii. (6 points) Consider the cardinality of R to be 1%, 5%, 10%, 15% and 20% of the number of
nodes in the original network. Create a single plot to show the change in largest connected
subgraph cardinality (number of nodes) with respect to di erent cardinalities of R when con-
sidering both random and degree-based attacks. The x-axis should be the number of nodes
removed and the y-axis the cardinality of the largest component in the residual graph, respec-
tively. Each point should be color corresponding to the method used (red for random, orange
for greedy). Each point should be labeled with text indicating the cardinality of the largest
component. Use a plot character pch=19 and scale cex=1.5.
This should require no more than 7 lines of R code.
iv. (4 points) Use the plot from part (b) to explain why the results from (iii) make sense.
IE 332 Homework #2 Page 2 of 4
3. Consider the following hockey league context:
Every team has a unique name and jersey color.
Every game is played in a rink, which is located in a particular city and has a particular name.
Rink names may change over time.
Each team plays in exactly one home rink, but a rink may be home to zero or more teams.
A hockey team has at least one player, and a player can only play for one team at a time.
Every player has a unique jersey number and will play one position fromfright wing (RW), center
(C), left wing (LW), defense (D), goalie (G)g.
Coaches have a certain preferred strategy of fdefensive, o ensive, balancedg.
Coaches and players are all considered employees of a team. Every employee has a name, salary,
contract start and end date, and date of birth.
Each team has exactly one coach who cannot work for another team.
(a) (10 points) Consider the following ER Diagram in Chen Notation. Identify any errors and draw a
corrected diagram in Chen Notation.
(b) (8 points) Draw a Crow’s Foot notation diagram of your corrected solution from above.
(c) (8 points) Provide the CREATE TABLE statements.
(d) (8 points) Provide the INSERT statements to create two teams with two players each such that none
of the above logical constraints are violated.
(e) (3 points) Assume that a player is traded from one team to another. Provide the SQL syntax to
update the database accordingly.
IE 332 Homework #2 Page 3 of 4
4. Consider a review website with the following schema, where underlined values indicate the primary key.
Provide SELECT statements corresponding to the questions below.
Businesses(ID:int, Name:string, City:string, state:varchar(2))
* Each business has a name, unique ID, and is located in a speci c city and state.
Users(ID:int, Name:string, Followers:int)
* Each users has a name, unique ID, and a number of followers.
Reviews(bID:int, uID:int, Stars:int, Date:date, useful:int, funny:int, cool:int)
* Each user gives a review to a business by assigning a number of stars to it. The review is made on a
speci c date and a number of others may have found it useful, funny or cool.
BusinessTypes(bID:int, Type:string)
* Each business is of a certain type (e.g., restaurant, construction, dinner).
SUGGESTION: It could be useful, but not necessary, to create the four tables in MySQL and populate
them with data to test your queries on. Be careful to create useful data that isn’t misleading since testing
alone does not guarantee correctness (evidence 6= proof!).
(a) (2 points) How many distinct business types are there?
(b) (2 points) How many businesses are in Indiana (IN)?
(c) (3 points) How many more or less businesses does Indiana (IN) have compared to Ohio (OH) and
Alaska (AK) combined?
(d) (8 points) Which users have visited more than 10 distinct states? The output should be ordered
(decreasing) by the number of states users visited, where any ties are broken by the user ID
(increasing order). Each outputted user should show the ID, name and number of visited states.
(e) (6 points) How many businesses in Indiana have the word \corn" in their name, but are not identi-
ed as a \food" business? Your result should be ordered by business ID and give the ID and name
only.
(f) (10 points) What are the top 5 businesses in West Lafayette, Indiana for dinner (i.e., business type
is \dinner")? For this question you will use the average star rating as a measure of quality, where
higher number of stars is better. Sort the results by their average rating, breaking any ties by the
number of reviews received (decreasing order), then by business ID (increasing order). The output
should contain, for each business its ID, name, average number of stars and total number of reviews.
(g) (10 points) Which Indiana businesses have more than 20 reviews, and all of them are 5 stars?
Order the output by the number of reviews received (decreasing order), and business ID (increasing
order). Each business displayed should show its ID, name and number of reviews.