CSE 100 // PA4
Six degrees of Kevin Bacon
Assignment Overview
Starter code: download here Databases updated on 5/30
IMPORTANT Starter code for this PA is very limited. You’re in charge of designing your own
classes. Much like in PA3 we’ll simply do make and black-box test your code.
Checkpoint Pathfinder
You will write a program to play (and win) the generalized Kevin Bacon trivia game. Your program
will take as input ANY two actors/actresses and find the shortest sequence of shared movies
between them. Part of this program (unweighted shortest path) is due at the checkpoint deadline.
Final submission Pathfinder (Complete) + Actor Connections
You will complete the weighted shortest path version of Pathfinder. You will also write a program
that answers the query: “Given two actors X and Y, after which year did they first become connected?”
Extra credit Social network properties and PDF write-up
Compute a set of interesting social network properties on movie_casts.tsv. You’ll be required to write
up what you did, as well as turn in your code.
Starter Code files
1. Makefile Which you will have to modify as you add source code
2.ActorGraph.h/cpp Contains starter code to read movie_casts.tsv.
Datasets, test input files and reference solution
1. movie_casts.tsv The database that contains the majority of actors/ actresses found in IMDb. See
the notes below for more details.
2.test_pairs.tsv Text file containing the pairs of actors to find paths/connections. (Explained in detail
later.)
Program testing can be used to show the presence of bugs, but never to show their absence!
Edsger W. Dijkstra
3.out_paths_unweighted.tsv Output file generated by Pathfinder that stores the results from
finding the unweighted shortest path between two actors in test_pairs.tsv
4.out_paths_weighted.tsv Output file generated by Pathfinder that stores the results from finding
the weighted shortest path between two actors in test_pairs.tsv
5.refpathfinder Solution executable that implements Pathfinder
6.refactorconnections Solution executable that implements ActorConnections
7.out_connections.tsv Output file generated by ActorConnections on test_pairs.tsv
8.pair.tsv Text file that contains 100 pairs of actors you can use for testing.
IMPORTANT notes regarding movie_casts.tsv
We have provided you a tab-separated file movie_casts.tsv that contains the majority of actors and
actresses found in IMDb and the movies in which they have played. Specifically, the file looks like this
( denotes a single tab character, i.e. ‘\t’):
Actor/ActressMovieYear
50 CENTBEEF2003
50 CENTBEFORE I SELF DESTRUCT2009
50 CENTTHE MC: WHY WE DO IT2005
50 CENTCAUGHT IN THE CROSSFIRE2010
50 CENTTHE FROZEN GROUND2013
...
A.The first column contains the name of the actor/actress, the second column contains the name of a
movie they played in, and the last column contains the year the movie was made.
B.Each line defines a single actor→movie relationship in this manner (except for the first line, which is
the header). You may assume that actor→movie relationships will be grouped by actor name, but
do not assume they will be sorted.
C.èè Note that multiple movies made in different years can have the same name, so use
movie year as well as title when checking if two are the same. èè
D.Some actors have a "(I)" appended to their name - so "Kevin Bacon" is really "Kevin Bacon
(I)". Make sure you DO NOT format the names of actors or movies beyond what is given in the
tab-separated input file. In other words, each actor's name should be taken exactly as the actor's
name appears in the movie_casts.tsv file. During grading, the actor's name in the test file will
match the actor's name in the movie_casts.tsv file.
Note The movie_casts.tsv file is pretty big, so DO NOT UPLOAD to Vocareum. A copy already
exists there and will be used when running the grading scripts.
refpathfinder and refactorconnections are reference solutions you may use to check the
expected functionality in various tests.
IMPORTANT NOTES
■ DO NOT HARDCODE ANY TEST CASE. You’ll get a 0 (zero) on this PA if you do.
■ DO NOT USE ANY OTHER executable names than the ones in your Makefile. Refer to the
submission instructions for each part for the required name of executables.
CHECKPOINT [15 points]
1 Graph design and implementation
Design and implement your own classes to support a graph
implementation for the actor-movie relationships needed in this PA.
In order to complete the rest of the assignment, you must first design your graph structure and implement
the necessary classes. In your graph, each actor/actress will define a single node. Two nodes (i.e., actors)
will be connected by an undirected edge if the corresponding actors played in the same movie. Multiple
undirected edges can exist between the same two nodes (which would imply that the two actors played in
multiple movies together).
Implementation Checklist
❏Review ActorGraph.cpp This contains starter code to read the movie_casts.tsv file (the
code open a file and parses the actor/movie/year from each line). For the implementations below, you
may have to create separate .cpp files for your different classes based on your design.
❏Design/Implement your node objects What information does your node need to contain?
❏Design/Implement your "edges" How will you connect actors (nodes), relationships (edges),
and movies to each other so as to allow efficient traversal of the graph without needlessly copying
whole objects around? Do you want to have a data structure for edges or merely represent them as
connections between two nodes? Pointers and/or vector indices might come in handy…
❏Test your graph implementation Load the movie_casts.tsv file, you should expect to find
actors or nodes, movies, and directed edges. Note: if we implement our1, 941 7 4, 521 2 , 16, 1240 4
graph with directed edges, every undirected edge will be represented by two directed edges.
Do NOT use any pre-built graph data structures, like the Boost Graph Library (BGL). Limit yourself to using
the data structures available in the C++ STL data structures.
2 Pathfinder: unweighted shortest path
Write a program called pathfinder (in pathfinder.cpp) to find the shortest
path from one actor to another actor through shared movies.
Implementation Checklist
❏Implement pathfinder to work on an unweighted graph
Read the notes that follow for more information.
./pathfinder sample usage
Your program should be called like this (see detailed explanation of arguments below):
> ./pathfinder movie_casts.tsv u test_pairs.tsv out_paths_unweighted.tsv .
where test_pairs.tsv contains:
Actor1/Actress1 Actor2/Actress2
BACON, KEVIN (I)HOUNSOU, DJIMON
BACON, KEVIN (I)KIDMAN, NICOLE
BACON, KEVIN (I)WILLIS, BRUCE
BACON, KEVIN (I)GIAMATTI, PAUL
HOUNSOU, DJIMON50 CENT
and your program produces an output file out_paths_unweighted.tsv containing the following
(although the particular movies may not match, the total path lengths should match your output):
(actor)--[movie#@year](actor)--...
(BACON, KEVIN (I))--[ELEPHANT WHITE#@2011](HOUNSOU, DJIMON) (BACON, KEVIN (I))--[SUPER#@2010](MCKAY, COLE S.)--[FAR AND AWAY#@1992](KIDMAN, NICOLE)
(BACON, KEVIN (I))--[SUPER#@2010](MORENO, DARCEL WHITE)--[LAY THE FAVORITE#@2012](WILLIS, BRUCE) (BACON, KEVIN (I))--[A FEW GOOD MEN#@1992](MOORE, DEMI)--[DECONSTRUCTING HARRY#@1997](GIAMATTI, PAUL)
(HOUNSOU, DJIMON)--[IN AMERICA#@2002](MARTINEZ, ADRIAN (I))--[MORNING GLORY#@2010](50 CENT)
man pathfinder
./pathfinder will take 4 (four) command-line arguments in this order:
1.Name of text file containing the tab-delimited movie casts (e.g. movie_casts.tsv).
2.Lower-case character u or w
u builds the graph with unweighted edges
w builds the graph with weighted edges
(you do not need to implement for checkpoint, but you’ll implement it for final submission)
3.Name of text file containing the pairs of actors to find paths First line in the file is a
header, and each row contains the names of the two actors separated by a single tab character
4.Name of output text file Pathfinder will create a new file to store the results from finding the
shortest path between two actors. (Continued on next page.)
First line of the file is a header, and each row contains the paths for the corresponding pair of actors
and input pairs file (in the same order). Each path will be formatted exactly as follows:
()--[#@]()--[#@]...
....etc where the movie listed between each pair of actors is one where they both had a role.
(Refer to the example above.)
Grading Notes
As per previous PAs, your code will be autograded on submission.
1. For the checkpoint, you are only required to have the unweighted portion of pathfinder working i.e.
we will test your implementation with all 4 arguments except the second which we will always insert
as a u
2.The specific path your pathfinder program outputs may be different than the reference solution’s. As
long as the total path lengths are the same, then you are fine.
3.Complete pathfinder is due at final submission, so if you don’t get the “unweighted edges”
version working for the checkpoint, you must get it working by the final submission.
4.You may ONLY use C++ STL data structures and NOT the Boost Graph LIbrary (BGL).
5.We have provided small graph files along with the starter code. Once you get your code to work on
these small graphs, test them on larger graphs. 3
Submission instructions
Efficiency requirement All tests are executed with a 60s timeout. Your code must finish within 60s or you
won’t get any points for that particular test case.
1.
Output format requirement We gave very specific instructions on how to format your programs' output
because our auto-grader will parse your output in these formats, and any deviation from these exact formats
will cause the autograder to take off points. There will be no special attention given to submissions that do not
output results in the correct format. If you do not follow the exact formatting described here, you are at risk of
losing ALL the points for that portion of the assignment.
NO EXCEPTIONS!!!
All files must be turned in on Vocareum. Make sure to submit: 1.
Your updated Makefile. We should be able to compile your code by simply running make
pathfinder.
2.ALL .h/.cpp files you created. Upload the classes you designed and implemented for this part.
Do NOT submit any dataset. All required datasets for grading are already available on Vocareum and
will be passed correctly to your program.
FINAL SUBMISSION [85 points]
1 Pathfinder: weighted shortest path [25 points]
Complete your pathfinder program by implementing the weighted
edges version of your program. Edges will have a weight equal to the
age of the movie since 2015.
The reason for assigned edge-weights based on the age of the movie is because we will want to choose
newer movies over older movies when connecting two actors.
Implementation Checklist
❏ Make pathfinder work for weighted graphs (see below for details).
Formula for edge weights
If we are defining an edge between two actors that played in a movie made in year then the weight of,Y
that edge will be:
eight 1 (2015 ) w = + −Y
Note that we are using 2015 instead of 2018, which is because the dataset only contains movies released
in 2015 and earlier. Don't accidentally use 2018!
Example
Running the following command:
> ./pathfinder movie_casts.tsv w test_pairs.tsv out_paths_weighted.tsv
should produce an output file out_paths_weighted.tsv containing the following (although the
particular movies may not match, the total path weights should match your output):
(actor)--[movie#@year](actor)--...
(BACON, KEVIN (I))--[ELEPHANT WHITE#@2011](HOUNSOU, DJIMON)
(BACON, KEVIN (I))--[R.I.P.D.#@2013](HUSS, TOBY (I))--[LITTLE BOY#@2015](CHAPLIN,
BEN)--[CINDERELLA#@2015](MARTIN, BARRIE (II))--[PADDINGTON#@2014](KIDMAN, NICOLE)
(BACON, KEVIN (I))--[R.I.P.D.#@2013](BELTRAN, JONNY)--[THE WEDDING RINGER#@2015](ROGERS,
MIMI (I))--[CAPTIVE#@2015](WILLIS, BRUCE)
(BACON, KEVIN (I))--[R.I.P.D.#@2013](HOWARD, ROSEMARY (II))--[THE AMAZING SPIDER-MAN
2#@2014](GIAMATTI, PAUL)
(HOUNSOU, DJIMON)--[THE VATICAN TAPES#@2015](SCOTT, DOUGRAY)--[TAKEN 3#@2014](HARVEY, DON (I))--[THE
PRINCE#@2014](50 CENT)
Note The specific path your pathfinder program outputs may be different than from reference solution.
As long as the total path weights are the same, then you are fine.
To efficiently implement Dijkstra's algorithm for shortest path in a weighted graph, you should make use of
a priority queue. You can implement your own, or use the STL C++ implementation. Note that it does not
support an update_priority operation, so you’ll have to find a way around that. Think about what happens
if you insert the same key twice into the heap, but with a lower priority. Which one gets popped first? When
you pop a key-priority pair, how do you know if it is valid/up-to-date or not?
Reference solution
We included a reference solution in the starter code. The usage for running the reference solution is
> ./refpathfinder movie_casts.tsv w test_pairs.tsv out_paths_weighted.tsv
Note The specific path your pathfinder program outputs may be different than from reference solution.
As long as the total path weights are the same, then you are fine.
2 Actor Connections [60 points]
Write a program called actorconnections to answer the query: "Given actors X and Y, after which year did they become connected?"
By connected, we mean that there exists a path between actors X and Y in the equivalent movie graph
(similar to that constructed in Part 1) with the exception that the movie graph under consideration only
includes movies that were made before and including a certain year.
Implementation Checklist
actorconnections will give the option to use two different approaches to solve the actor-connections
problem. You have to implement both options:
1.Using the graph structure By solving the widest path problem and returning the
corresponding year to the bandwidth between actors X and Y – if a path exists. (Aside: Can you see
why this is simply the widest path problem?)
Relevant aside As covered in Worksheet 8, the widest path problem can be solved using a simple
modification to Dijkstra’s algorithm. It would be a violation of the Don't Repeat Yourself (DRY) principle to copy
the code for Dijkstra from pathfinder and simply modify the update rule. Given that only the update rule
changes, all we need is a way to pass in a different strategy for the update rule. One way to accomplish that is
by using lambdas in C++. If you’re unfamiliar with lambdas, you can learn more about them here:
https://www.cprogramming.com/c++11/c++11-lambda-closures.html
2.Without using the graph By using the Disjoint Set ADT (a.k.a union find). The disjoint-set (i.e.,
"union-find") data structure allows you to keep track of all connected sets of actors without
maintaining the corresponding graph structure. You might still consider adding movies incrementally,
and if a movie creates a path between two actors that were not connected before, two disjoint sets
would be merged into a single set in your union-find data structure. You should be able to then query
your data structure about the connectivity of any specific actor pairs.
Test against refactorconnections (see (Partial) Reference solution below)
./actorconnections sample usage
Running the following command:
> ./actorconnections movie_casts.tsv test_pairs.tsv out_connections.tsv ufind
should run your code (using the union-find algorithm) to produce an output file out_connections.tsv
containing the following:
Actor1Actor2Year
BACON, KEVIN (I)HOUNSOU, DJIMON1992
BACON, KEVIN (I)KIDMAN, NICOLE1991
BACON, KEVIN (I)WILLIS, BRUCE1990
BACON, KEVIN (I)GIAMATTI, PAUL1992
HOUNSOU, DJIMON50 CENT2003
man actorconnections
actorconnections will take 4 (four) command-line arguments in this order:
1. Name of a text file containing the movie casts in the same format as movie_casts.tsv.
Again, this file is quite large (6.4M), so you should create smaller versions to test your implementation
as a first step. We have provided the smaller graphs for testing along with the starter code. Once you
get it working on these graphs, test it on larger graphs and movie_casts.tsv.
2.Name of a text file containing the names of actor pairs Actor pairs will appear
tab-separated on each line (same format as test_pairs.tsv).
3.Name of your output text file This file should contain in each line an actor pair followed by the
year (tab-separated) after which the corresponding actor pair became connected (you will do all actor
pairs specified in the file from step 2, one on each line).
If two actors are never connected or if one or both of the actors is not in the movie cast file given to
you, simply append 9999 in the corresponding line of the output file. To further clarify, if the second
argument was a file containing the actor pair "BLANCHETT, CATE" and "REEVES, KEANU" and they
only became connected after adding a movie made in , your program should output the actor9971
pair and in their line of the output file. If they never became connected even after adding all the9971
movies from in the movie cast file to your graph, you should output on that line.9999
4.Lower-case string widestp or ufind. This option determines which algorithm will be used in
your program. If the fourth argument is not given, by default your algorithm should use the union-find
data structure (i.e., the equivalent of specifying ufind as the fourth argument). We will test your code
with both flags.
Edge case We will not test you on the corner case where both the actors are the same. You can handle
it however you want.
(Partial) Reference solution
We included a (partial) reference solution. The reference solution doesn’t take a fourth argument – it will
always use the same strategy. You should check the output of your code against the reference solution to
make sure everything is fine.
The usage for running the reference solution is
> ./refactorconnections movie_cast_file.tsv pair_file.tsv output_file.tsv
3
Submission instructions
Efficiency requirement For the final submission, each test is executed with a timeout based on its
complexity – varying from 10s to 50s. Your code must finish within the allotted time or you won’t get any points
for that particular test case.
Output format requirement We gave very specific instructions on how to format your programs' output
because our auto-grader will parse your output in these formats, and any deviation from these exact formats
will cause the autograder to take off points. There will be no special attention given to submissions that do not
output results in the correct format. If you do not follow the exact formatting described here, you are at risk of
losing ALL the points for that portion of the assignment.
NO EXCEPTIONS!!!
All files must be turned in on Vocareum. Make sure to submit:
1. Your updated Makefile. We should be able to compile your code by simply running make
where is one of pathfinder or actorconnections.
2.ALL .h/.cpp files you created. Upload the classes you designed and implemented for this part.
Do NOT submit any dataset. All required datasets for grading are already available on Vocareum and
will be passed correctly to your program.
EXTRA CREDIT [10 points]
Introduction
movie_casts.tsv is simply a social network – one limited to Hollywood connections. Social networks
are simply a graph of human relationships. In light of that, many graph properties become even more
interesting to be studies in the context of social networks. In the extra credit will take you on a tour of
graph properties that highlight interesting patterns when analyzed on a social network.
Before you start As you can possibly tell, extra credit got more challenging as the quarter went on. Extra
credit problems are difficult – because they are designed to be difficult. The point of having extra credit
problems is to encourage self-exploring and researching – it is challenging, but also very fun. Most tutors have
not done these extra credit problems before, so there may be limited support in the lab for help with extra credit
problems. Good luck!
Required deliverables
Turn in all deliverables on Vocareum.
1.extracredit.pdf A write-up with one section per problem, containing: 1) the answer to the
problem (this may be a value or a visualization) 2) If applicable, a clear description of your solution
and and analysis of the observed phenomenon.
2.Your updated Makefile. We should be able to compile your code by simply running make
extracredit. When we run ./extracredit movie_casts.tsv we should observe output
that is consistent with the answers on your report.
3.ALL .h/.cpp files you created. Upload the classes you designed and implemented for this part.
For this part make sure they are clearly documented.
Problem 1 Centrality [2 points]
Centrality. In graph theory and network analysis, indicators of centrality identify the most important vertices within a graph. (source: Wikipedia)
This PA revolves around the meme of six-degrees of Kevin Bacon – an actor so prolific most actors are
said to be six hops away from him. Is that really so? For this part identify and report the ten most central
vertices in the network. On your report indicate the measure of centrality you adopted and explain it
briefly. Is Kevin Bacon truly deserving of this meme? Were you able to reveal other Kevin Bacons? Is there
an actress as central – or possibly more central – than Kevin Bacon?
Problem 2 Clustering coefficient [2 points]
Clustering coefficient. In graph theory, a clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together.
(source: Wikipedia)
In particular, we want you to measure the global clustering coefficient as defined on Wikipedia:
For this problem compute and report the global clustering coefficient for the movie_casts.tsv
network. In your report make sure to indicate and explain the algorithm you used to compute the number
of closed triplets. Also, what does the clustering coefficient you found say about this network?
Problem 3 Degeneracy [2 points]
Degeneracy. In graph theory, a k-degenerate graph is an undirected graph in which every subgraph has a vertex of degree at most k: that is,
some vertex in the subgraph touches k or fewer of the subgraph's edges. The degeneracy of a graph is the smallest value of k for which it is
k-degenerate. (source: Wikipedia)
For this problem compute and report the degeneracy of the graph. In your report be sure to indicate the
algorithm you used to compute it and write a concise and clear explanation of how the algorithm works.
Problem 4 Interesting maximal cliques [4 points]
Clique. In the mathematical area of graph theory, a clique [...] is a subset of vertices of an undirected graph such that every two distinct vertices in the
clique are adjacent. [...] A maximal clique is a clique that cannot be extended by including one more adjacent vertex. (source: Wikipedia)
For this problem you must implement, report and explain an algorithm of your choice to find maximal
cliques in the movie_casts.tsv network. In doing so, sieve through the cliques to find an interesting
one that unearths an amusing pattern in the graph (e.g. is there a clique of Hollywood actors who
participated in underground poker games ran by an individual known as Molly?). Include in your report a
visualization of the clique (e.g. using something like graphviz) and write a few sentences about why it’s
interesting.