Python程序调试、辅导program编程设计、讲解Python编程语言辅导Database|调试Matlab程序

2021/1/24 HW3 - Jupyter Notebook
localhost:8892/notebooks/Downloads/HW3.ipynb 1/8
HW3
PageRank
What is the most important website on the internet? Who is the "key player" on a sports team? Which countries
are the most central players in the world economy? There is no one correct answer to any of these questions,
but there is a most profitable one. PageRank (https://en.wikipedia.org/wiki/PageRank) is an algorithm for
ranking individual elements of complex systems, invited by Sergey Brin and Larry Page. It was the first and
most famous algorithm used by the Google Search engine, and it is fair to say that the internet as we know it
today would not exist without PageRank.
In this assignment, we will implement PageRank. There are many good ways to implement this algorithm, but in
this assignment we will use our newfound skills with object-oriented programming and iterators.
How it works
For the purposes of this example, let's assume that we are talking about webpages. PageRank works by
allowing a "random surfer" to move around webpages by following links. Each time the surfer lands on a page, it
then looks for all the links on that page. It then picks one at random and follows it, thereby arriving at the next
page, where the process repeats. Eventually, the surfer will visit all the pages one or more times. Pages that the
surfer visits more frequently have higher PageRank scores. Because the surfer moves between linked pages,
PageRank expresses an intuitive idea: important pages are linked to other important pages. This diagram
(https://en.wikipedia.org/wiki/PageRank#/media/File:PageRanks-Example.jpg) from Wikipedia gives a nice
illustration. Note that more important webpages (higher PageRank) tend to be connected to other important
webpages.
A schematic for PageRank.
Data
You can complete this assignment using data from one of two sources.
Option 1: Hamilton
This data set comes from the hit Broadway musical "Hamilton."
The logo of the musical Hamilton, showing a
silhouette dressed in period custom standing on
top of a five-pointed star.
The Hamilton data set
2021/1/24 HW3 - Jupyter Notebook
localhost:8892/notebooks/Downloads/HW3.ipynb 2/8
The good folks at The Hamilton Project (http://hamilton.newtfire.org/) analyzed the script for us, obtaining data
on who talks about whom in each of the show's songs. When character A mentions character B, we'll think of
this as a link from A to B, suggesting that B might be important.
If you use this data set, listening to the soundtrack while working is strongly recommended.
Option 2: Global Airline Network
A map of the world, with many curved green
lines connecting different points on the map.
The points are airports, and the curved green
lines are flight routes.
The global airline network
Back in the Before Times, lots of people flew on airplanes. This data set includes a "link" from Airport A to
Airport B whenever there is a flight from B to A. This data set was collected by the OpenFlights Project
(https://openflights.org/data.html).
(A). Define Functions
In this part, all you have to do is hit shift + enter on the code block supplied below. This block defines two
functions. The first one retrieves the data from the internet and saves it to your local computer, while the second
reads in the data, producing a list of tuples. It's not important for you to be familiar with the code in these
functions right now -- we'll discuss them early in Week 4.
In [1]:
import urllib
import csv
def retrieve_data(url):
"""
Retrieve a file from the specified url and save it in a local file
called data.csv. The intended values of url are:

1. https://philchodrow.github.io/PIC16A/homework/HW3-hamilton-data.csv
2. https://philchodrow.github.io/PIC16A/homework/HW3-flights-data.csv
"""

# grab the data and parse it
filedata = urllib.request.urlopen(url)
to_write = filedata.read()

# write to file
with open("data.csv", "wb") as f:
f.write(to_write)

def read_data(path):
"""
read downloaded data from a .csv file, and return a list of tuples.
each tuple represents a link between states.
"""
with open(path, "r") as f:
reader = csv.reader(f)
return [(row[0], row[1]) for row in list(reader)]
2021/1/24 HW3 - Jupyter Notebook
localhost:8892/notebooks/Downloads/HW3.ipynb 3/8
(B). Grab the Data
The data live at the following URLs:
Hamilton: https://philchodrow.github.io/PIC16A/homework/HW3-hamilton-data.csv
Airline: https://philchodrow.github.io/PIC16A/homework/HW3-flights-data.csv
In each data set, each row corresponds to a "link" between objects. In Hamilton, the pairs have format
mentioner, mentioned while in the airline network the rows have format origin, destination.
Pick one of these data sets, and set the variable url appropriately by uncommenting one of the two lines
below. Then, call retrieve_data() and read_data() . The path argument for read_data() should be set to
"data.csv" . Create a variable called data to hold the return value of read_data() .
Your solution
In [ ]:
(C). Examine the structure of the data
This would also be a good time to inspect the data to make sure you understand how it is structured. Write a
function describe(n) that describes the meaning of the n th row of the data set you chose. In the Hamilton
data set, your function should do the following:
describe(5)
# output
"Element 5 of the Hamilton data set is ('burr', 'betsy'). This means that Burr mentions Be
tsy in a song."
In context of the airline flights data, your function should instead do this:
describe(5)
# output
"Element 5 of the flights data set is ('SIN', 'BKK'). This means that there is a flight fr
om BKK to SIN."
Please attend to capitalization and formatting. While the standard string concatenation operator + is
completely fine for this task, the fancy str.format() function may make your code somewhat simpler. This
page (https://realpython.com/python-formatted-output/) has some useful examples in case you'd like to try this.
Your Solution
# uncomment the second line if you'd prefer to
# work with the flights data.
url = "https://philchodrow.github.io/PIC16A/homework/HW3-hamilton-data.csv"
# url = "https://philchodrow.github.io/PIC16A/homework/HW3-flights-data.csv"
# Call your functions below
2021/1/24 HW3 - Jupyter Notebook
localhost:8892/notebooks/Downloads/HW3.ipynb 4/8
In [ ]:
(D). Data to Dictionary
Write a function called data_to_dictionary that converts the data into a dictionary such that:
1. There is a single key for each character (in Hamilton) or airport (in flights).
2. The value corresponding to each key is a list of the characters/airports to which that key links. The list
should contain repeats if there are multiple links.
Here's an example of the desired behavior on a fake data set.
data = [("a", "b"),
("a", "b"),
("a", "c"),
("b", "c"),
("b", "a")]
data_to_dictionary(data)
# output
{"a" : ["b", "b", "c"], "b" : ["a", "c"]}
Your Solution
In [ ]:
(E). Define a PR_DiGraph class
A directed graph, or DiGraph, is just a set of arrows ("edges") between objects ("nodes"). It is a natural way to
represent data that represents one-way relationships, such as links from one webpage to another or mentions
of one character by another. We already saw a directed graph above when we introduced the idea of
PageRank. Here's a paired-down example.
Six circles, representing nodes, labeled A
through F. There are directed arrows between
certain pairs of nodes.
Example of a directed graph.
Implement a PR_DiGraph class with a custom __init__() method and a linked_by() method. The
__init__() method should accept two arguments: data and iteration_limit . The __init__() method
should then construct an instance variable self.link_dict which is simply the output of data_to_dictionary
applied to the argument data . __init__() should also construct an instance variable
self.iteration_limit , which simply takes the value of iteration_limit supplied to __init__() . Don't
worry about that one for now.
Then, define a method self.linked_by(x) which, when called, returns the value self.link_dict[x] .
2021/1/24 HW3 - Jupyter Notebook
localhost:8892/notebooks/Downloads/HW3.ipynb 5/8
Finally, add an __iter__ method, which returns an object of class PR_Iterator . We will define this class in
the next part.
Example session (using Hamilton):
D = PR_DiGraph(data, iteration_limit = 10000)
D.linked_by('peggy')
# output
['peggy', 'schuylerSis']
Your Solution
In [ ]:
(F). Implement PR_Iterator
Define a PR_Iterator class with a custom __next__() method.
The __init__ method of this class should create instance variables to store the PR_DiGraph object from
which it was constructed; a counter i , starting at 0, to log the number of steps taken, and a current_state
variable whose value is one of the keys of the link_dict of the Pr_DiGraph . You can choose its initial value
arbitrarily; in my solution code I chose self.current_state = "hamilton" .
We are going to use iteration to implement the PageRank algorithm. This means we are going to imagine a
surfer who is following the links in our data set. Implement the following two methods:
1. follow_link() .
A. Pick a random new character mentioned by the current character, or new airport served by the current
airport. Let's call this next_state .
B. If next_state != current_state , set current_state to next_state .
C. Otherwise (if next_state == current_state ), teleport (see below).
D. You might run into KeyError s, in which case you should again teleport (use a try-except block).
2. teleport() .
A. Set the current state to a new state (key of the link dict) completely at random.
Hint: use random.choice from the random module to choose elements of lists.
Finally, implement __next__() . __next__() should do follow_link() with 85% probability, and do
teleport() with 15% probability. You should also define a custom StopIteration condition to ensure that
only as many steps are taken as the iteration_limit supplied to the PR_DiGraph initializer.
1. To do something with 85% probability, use the following:
if random.random() < 0.85:
# do the thing
else:
# do the other thing
Example Usage
2021/1/24 HW3 - Jupyter Notebook
localhost:8892/notebooks/Downloads/HW3.ipynb 6/8
After you define your class, run the following code and show that it works. Note: your precise sequence may be
different from mine.
D = PR_DiGraph(data, iteration_limit = 5)
for char in D:
print(char)
following link : current state = burr
following link : current state = washington
following link : current state = burr
following link : current state = hamilton
teleporting : current state = washington
I have added printed messages here for you to more clearly see what should be happening, but it is not
necessary for you to do this. It is sufficient for your output to look like:
D = PR_DiGraph(data, iteration_limit = 5)
for char in D:
print(char)
burr
washington
burr
hamilton
washington
Your Solution
In [ ]:
In [ ]:
(G). Compute PageRank
Finally, we are ready to compute the PageRank in our data set. Initialize a PR_DiGraph with a large iteration
limit (say, 1,000,000). Use a for -loop to allow your surfer to randomly move through the data set. The number
of times that the surfer visits state x is the PageRank score of x .
Create a dict which logs how many times a given state appears when iterating through the PR_Digraph . So,
this dictionary holds the PageRank score of each state.
Your Solution
import random
# run the below
D = PR_DiGraph(data, iteration_limit = 5)
for char in D:
print(char)
2021/1/24 HW3 - Jupyter Notebook
localhost:8892/notebooks/Downloads/HW3.ipynb 7/8
In [ ]:
(H). Display Your Result
Use your favorite approach to show the results in sorted format, descending by PageRank score. The entries at
the top should be the entries with highest PageRank. What are the most important elements in the data set?
You may show either the complete list or just the top 10.
Check your code by comparing your top 10 to mine. Because we are using a randomized algorithm, your
results will not agree exactly with mine, but they should be relatively close. If your top 10 list is very different,
then you might want to revisit your previous solutions.
For Hamilton, my top 10 were:
[('hamilton', 166062),
('burr', 99180),
('washington', 92246),
('jefferson', 72450),
('eliza', 51485),
('angelica', 48042),
('madison', 37421),
('lafayette', 34297),
('lee', 33678),
('jAdams', 31121)]
For the flights data, my top 10 were:
[('LHR', 18043), # London Heathrow
('ATL', 16370), # Atlanta
('JFK', 14795), # New York JFK
('FRA', 14156), # Frankfurt
('CDG', 14073), # Charles de Gaulle (Paris)
('LAX', 13199), # Los Angeles
('ORD', 12915), # Chicago O'Hare
('PEK', 12525), # Beijing
('AMS', 12410), # Amsterdam Schiphol
('PVG', 11517)] # Shanghai
Your solution
In [ ]:
(I). Submit!
Check that your code is appropriately documented (comments and docstrings), and turn it in.
2021/1/24 HW3 - Jupyter Notebook
localhost:8892/notebooks/Downloads/HW3.ipynb 8/8