PageRank algorithm

Andrew Thangaraj

Aug-Nov 2020

Recap

  • Vector space \(V\) over a scalar field \(F\)
    • \(F\): real field \(\mathbb{R}\) or complex field \(\mathbb{C}\) in this course
  • \(m\times n\) matrix A represents a linear map \(T:F^n\to F^m\)
    • dim null \(T\) + dim range \(T\) = dim \(V\)
  • Linear equation: \(Ax=b\)
    • Solution (if it exists): \(u+\) null\((A)\)
  • Four fundamental subspaces of a matrix
    • Column space, row space, null space, left null space
  • Eigenvalue \(\lambda\) and Eigenvector \(v\): \(Tv=\lambda v\)
    • Distinct eigenvalues have independent eigenvectors
    • Basis of eigenvectors results in a diagonal matrix for \(T\)
    • Every linear map has an upper triangular matrix representation

PageRank algorithm

  • Origin of Google

    • Before Google, world wide web pages had extensive directory sites

    • Search mostly used to struggle to give relevant results first

  • Larry Page and Sergey Brin

    • Developed PageRank as an algorithm to rank web pages based on their importance

    • Importance was determined by the links connecting the pages

    • Algorithm rooted in linear algebra, eigenvalues and eigenvectors

Webgraph

Construction

  • Each web page is a node

  • Edge from Node \(i\) to Node \(j\) if

    • web page of Node \(i\) has a link to web page of Node \(j\)
  • Entire www is a “huge” graph

Degrees

  • Out-degree of a node: number of outgoing links

  • In-degree of a node: number of nodes that backlink

  • High in-degree: could indicate “importance” or popularity

Importance of nodes

  • \(x_i\): real number that denotes importance of Node \(i\)

  • \(n_i\): out-degree of Node \(i\)

  • \(L_i\): set of nodes that link to Node \(i\)

Condition enforced on importance

\[x_i=\sum_{j\in L_i}\dfrac{x_j}{n_j}\]

Importance, eigenvalues and eigenvectors

\(A=\begin{bmatrix} 0&0&0&1/3&0\\ 1&0&0&1/3&0\\ 0&1/2&0&0&0\\ 0&1/2&1/2&0&1\\ 0&0&1/2&1/3&0 \end{bmatrix}\)

\(x=(x_1,x_2,x_3,x_4,x_5)\)

Condition on importance

\(x = Ax\)

Importance vector \(x\) is an eigenvector of \(A\) with eigenvalue 1

Existence of importance vector

\(A\): \((i,j)\)-th entry is \(\dfrac{1}{n_j}\)

  • Node \(j\) links to Node \(i\)

  • \(n_j\): out-degree of Node \(j\)

columns of \(A\) add to 1

or, rows of \(A^T\) add to 1

\(v=(1,1,\ldots,1)\) (all-1 vector) is an eigenvector of \(A^T\) with eigenvalue 1

\(\lambda=1\) is an eigenvalue of \(A\)

Importance vector: eigenvector of \(A\) corresponding to eigenvalue 1

Example

\(A=\begin{bmatrix} 0&0&0&1/3&0\\ 1&0&0&1/3&0\\ 0&1/2&0&0&0\\ 0&1/2&1/2&0&1\\ 0&0&1/2&1/3&0 \end{bmatrix}\)

\(x=(2,4,2,6,3)\)

Highest importance: Node 3

Real webgraphs

Dataset from Laboratory for Web Algorithmics

  • Sample webgraph: European union domains in 2015

    • Nodes: 1,070,557,254

    • Edges: 91,792,261,600

    • average degree: 85.743

    • maximum in-degree: 20,252,239

    • maximum out-degree: 35,340

Importance vector: needs to be efficiently computed