The PageRank Problem
Hans Vandierendonck
CSC3021 Concurrent Programming, 2018–’19
PageRank is Google’s algorithm to rank the search results that match the queried keywords [1]. The algorithm
models the internet as a directed graph where web pages are represented as vertices and links between pages are
edges. The PageRank algorithm calculates the likelihood, or rank, that a page will be visited by people surfing
the web. The rank of a page depends on the rank of the pages that link to it, where pages that are frequently
pointed to tend to gain a higher rank. Also, pages pointed to by highly ranked pages tend to receive a higher rank.
The algorithm is itself quite simple but in practice it captures well the appreciation of the importance of pages by
humans.
You will create a Java program that solves the PageRank problem in Assignment 1; and you will create a
concurrent version of that code in Assignment 2. This document provides essential background on the PageRank
problem. You are encouraged to initiate your own reading on this subject.
1 Graphs
Let’s talk first about graphs. A graph is a collection of items that are connected. We call the items vertices and
the connections are called edges or links. An edge connects two vertices. We assume edges are directed, i.e., they
point form. a source vertex (also called a tail) to a destination vertex (also called a head).
In mathematical terms, a graph is described as a pair G= (V;E) where V is the set of vertices and E is the set
of edges. We describe an edge as an ordered pair of vertices, i.e., E V V. The number of vertices is given by
jVjand the number of edges is given byjEj.
An edge is described as an ordered pair. As such, the pair (u;v) describes an edge pointing from u to v,
assuming u and v are elements of the set V. We say that u is the source of the edge and v is the destination. We
also say that the edge is incident to u and to v.
The number of edges incident to a vertex is called the degree of the vertex. A vertex with degree 5 will thus
appear in the description of 5 edges, either as source or destination. We further distinguish the in-degree and the
out-degree. The in-degree of vertex v is the number of edges pointing to v. It is indicated by deg (v). The out-
degree is the number of edges pointing away from v. It is indicated by deg+(v). There is a relationship between
the different types of degrees. In the absence of self-edges (edges pointing from a vertex to itself), it is always the
case that deg (v)+deg+(v) =deg(v).
A graph may be directed or undirected. In a directed graph, each edge is an ordered pair, pointing from a source
to a destination vertex. Our definition of a graph focusses on directed graphs. In an undirected graph, edges have
no direction. They link two vertices without notion of source and destination. We emulate undirected graphs by
assuming that the connection between two vertices u and v is recorded by a directed edge from u to v and another
directed edge pointing from v to u. As such, in our representation, the number of edges in an undirected graph is
always an even number.
2 Problem Statement
The PageRank algorithm assumes that a person surfing the web visits a sequence of web pages. The next page in
the sequence is determined either by following one of the outgoing links of the current page, or by restarting the
1
sequence in a random web page. The PageRank value PR(d) for a page d is recursively defined by the following
equation:
PR(d) = 1 n +d X
s2in(d)
PR(s)
deg+(s) (1)
where n is the number of vertices in the web graph, in(d) is the set of web pages pointing to the page d, deg+(s) is
the number of outgoing links of page s (the out-degree) and is the damping factor. This equation has two terms.
The first term states that with probability 1 the surfer visits a random web page. The second term models the
likelihood of arriving at page d by following a sequence of links. Note that page s has deg+(s) outgoing links
and that its PageRank value is equally distributed among all these links. As such, every link is equally likely to be
followed.
Because of its importance, efficient ways to solve this equation have been extensively investigated. We will
describe an iterative method that is easy to implement and is amenable to concurrent execution.
3 The Power Iteration Method
The power iteration method solves the recursive equation 1 by feeding in estimates for PR(s) in the right-hand-
side of the equation and calculating new pagerank values by evaluating the formula. The newly calculated values
are then fed into the right-hand-side again and the process is repeated until the pagerank values converge, i.e., they
change only marginally in subsequent steps.
In order to describe the power iteration method unambiguously, we introduce an extra parametertthat indicates
the iteration and let PR(s;t) represent the pagerank of page s in iteration t.
Initially, the page ranks are uniformly initialised to the same value. As they represent a probability, we make
sure that they all add up to 1 and we set them to
PR(s;0) = 1n (2)
Given that the PageRank values at iteration t are known, the power method calculates the PageRank values at
iteration t+1 using the following formula;
PR(d;t+1) = 1 n + X
s2in(d)
PR(s;t)
deg+(s) (3)
This formula is identical to Equation 1 with the iteration numbers added.
The power method continues for as many iterations as necessary until the L1-norm of the difference between
PageRank values in subsequent iterations is less than a pre-defined error bound :
X
p
jPR(p;t+1) PR(p;t)j
!
10 7. In other words, the error due to floating-point arithmetic may be larger than the desired accuracy
of the computation (typically 10 7), which means the algorithm would never converge.