Ranking of nodes in networks


A graph is a structure composed of vertices and edges. Formally, consider a set of vertices $V$ with elements $j$ and a set of edges $E$ composed of ordered pairs $(i,j)$ of elements of $V$. The graph $G := (V,E)$ is determined by the sets $E$ and $V$. In the graph shown in Fig. 1 vertices are represented as circles and edges are represented as arrows. The set of vertices is $V=\{1,2,3,4,5,6\}$ and the set of edges is $E=\{(1,2),(1,5),(2,3),(2,5),(3,4), (3,6),(4,5),(4,6),(5,4)\}$. We will use $J$ to denote the number of vertices in the graph and, although not formally required, we will index vertices as $V=\{1,\ldots,J\}$.


Figure 1. Networks are naturally modeled as graphs. A broadly applicable definition of the importance of node $i$ is its degree of connectivity. However, it is not only a matter of counting the number of links to $i$, but also of pondering the importance of the nodes that point to $i$. A link by node 4 is more important than a link from node 1.

The interest in graphs is that they are natural models of networks. If we interpret vertices as persons and edges $(i,j)$ as representing the relation ``$i$ considers $j$ a friend, the graph is a model of a social network. If vertices denote employees and edges the relation ``$j$ is the boss of $i$, the graph represents a corporation's hierarchy. If we consider that vertices represent web pages and edges links from page $i$ to page $j$, the graph is model of the Web. Or, if vertices are associated with scientific papers and edges with the action ``paper $i$ cites paper $j$,'' the graph models the network of scientific knowledge.

An important problem in networks is to determine the importance of a node relative to other elements of the network. There are different ways of defining importance, but one that has broad applicability is to rank nodes according to their degree of connectivity. Think, e.g., of the network of scientific papers. It is quite likely that a paper that has received hundreds of citations is more important than a paper without any citations at all. In the graph in Fig. 1 this argument implies the reasonable conclusion that node 6, to whom 4 and 5 point, is more important than node 1 that lacks any incoming link.

An interesting conundrum arises in comparing vertices 2 and 3. While both nodes have only one incoming link, it seems reasonable to rank 3 higher than 2. Indeed, start noticing that 2 is clearly more important than 1 because while no vertex points to 1, 2 is at least pointed to by node 1. It then follows that it is reasonable to consider a link from node 2, which is received by 3, as more important than a link from node 1, which is received by 2, from where it follows that the rank of 3 is likely higher than the rank of 2. Similarly, compare nodes 4 and 5. While 5 is being pointed to by three nodes and 4 is pointed to by two nodes, it is not clear that 5 is in a better position than 4. As a matter of fact, it may well be that 4 is better positioned than 5 because the cumulative importance of the two nodes that point to her, namely 3 and 5 is larger than the cumulative importance of the three nodes that point to 5, namely 1, 2, and 4.

In summary, the rank of a given node, say $i$, measured in terms of degree of connectivity, is determined by the set of nodes that point to it. However, there are two properties of this set that determine the rank of $i$:

  • (P1) The number of elements in this set.
  • (P2) The rank of the members of this set.

The complementarity of properties (P1) and (P2) is most clear in a hierarchical network like the corporation's graph. At the same level of the hierarchy, it is better to have more direct subordinates -- Property (P1) -- but the higher in the hierarchy the better -- Property (P2). In terms of web pages, Property (P1) stresses the importance of having a large number of incoming links, while Property (P2) indicates that links from highly ranked pages are more relevant than links from obscure pages.

Accounting for Property (P2) is necessary but difficult because its definition is self-referential. To determine the rank a node it is necessary to know the rank of other nodes in the network, which in turn depend on the rank of the node in question. An elegant solution to overcome this problem is the use of a random walk in the graph that we describe next. The random walk model was originally proposed to rank scientific papers but its most popular application is the ranking of web pages by search engines.

Random walk in a graph


Consider a web surfer that visits a page and clicks on any of the page's links at random. She repeats this process forever. What fraction of his time will be spent on a given page? The answer to this question is the rank assigned to the page. The same idea can be used to rank the nodes in any of the networks described above. In abstract we refer to the random web surfer as performing a random walk in a graph. It is clear that the ranks obtained from the number of visits of the random walker satisfy properties (P1) and (P2). The time spent by the random walker at node $i$ increases when the number of nodes pointing to $i$ increases, as required by Property (P1), and when the rank of the nodes pointing to $i$ increases, consistent with Property (P2).


Figure 2. Random walk in a graph. Consider an agent that visits a node and proceeds to follow any of the node's links at random. She repeats this process forever. What fraction of his time will be spent on a given node? The answer to this question is the rank assigned to the page

For a formal problem definition of rank start defining $n(i)$ as the $i$-th node's neighborhood containing the indexes $j$ to which $i$ is pointing, i.e., $n(i):=\{i:(j,i)\in E\}$. Let $N_{i}$ be the number of nodes in neighborhood of $i$. Define finally $n^{-1}(i)$ as the set of nodes that point to $i$.

An outside agent approaches an arbitrary node $i$ at time $t=0$. From there, it jumps to one of the neighbors of $i$ at time $t=1$. The agent repeats this process forever. If the agent is visiting node $A(n)$ at time $t$ it will visit a node in the neighborhood $n(A(t))$ at time $t+1$. The fraction of time the agent spends visiting node $i$, is defined as the node's rank. To express this mathematically define the indicator function $I (A_{n}=i)$ with value 1 when the agent visits $i$ at time $t$ and 0 otherwise. The rank $r_{i}$ of node $i$ is then defined as

Since we are considering equal probabilities of jumping to any neighbor, the probability $P_{ij}$ of transitioning from node $i$ to node $j$ is

where, we recall, $N_{i}$ is the number of nodes in the neighborhood of $i$; see Fig. 2. We are now ready to build an algorithm to compute $r_{i}$. We start with a randomly chosen node $i$ and jump equiprobably to any of its neighboring nodes $j\in n(i)$. The probability of selecting any of this nodes is $P_{ij} = 1/N_{i}$. We repeat this process a large number of times $n$ and keep track of the number of visits to each node. The rank $r_{i}$ is then approximated as the ratio between the number of visits to node $i$ and the total time $n$, i.e.,

It is possible the agent reaches a dead-end node without any outgoing link. In such case, the next visit is to a different randomly selected node. This is equivalent to modifying the initial graph to include links from dead-end nodes to all other nodes in the graph.

This is certainly a possible algorithm, but we can obtain a faster version by exploiting the fact that these random visits can be modeled as a Markov chain (MC)

Markov chains


Notice that the probability of visiting node $j$ at time $n+1$, given that at time $n$ we are visiting node $j$ is $P_{ij}$ independently of how we reached $i$. This basic memoryless property implies that the stochastic process composed of node visits $A_{n}$ is a MC. Exploiting this observation it is possible to build a more efficient algorithm to compute ranks $r_{i}$.


Figure 3. Graph rank. Node rankings according to the random agent model are $r:=[0.04, 0.06, 0.07, 0.34, 0.26, 0.21]^{T}$. In the figure, the area of the circles is proportional to their rank. Notice how 4 has a higher rank than 5 despite having fewer incoming links. This is consistent with the fact that links to 5 come from lower ranked nodes.

Let $p_{i}(n):={\rm Pr}(A_{n}=i)$ denote the probability that the outside agent is at node $i$ at time $n$. It is possible to relate $p_{i}(n+1)$ with the probabilities at time $n$ of those nodes that can transition into $i$. Say that at time $n$, the agent is at some node $j\in n^{-1}(i)$. Then, with probability $P_{ji}$ (note the reversion of subindices) the agent transitions into $i$ at time $n+1$. Weighting this transition probability by the probability $p_{j}(n)$ and summing over all possible transitions we can write

where in the second equality we used that $P_{ji}=0$ for all nodes that do not link to $i$.

An interesting property of MCs is the existence of limit probabilities $\lim_{n\to\infty}p_{i}(n)$ under some conditions which we assume to hold. With further conditions it can be further proved that a property called ergodicity holds. Ergodicity implies that time average limits, as the one required to compute ranks $r_{i}$ coincide with the expected value of the event of interest. For ranks $r_{i}$ ergodicity implies that

We can then equate the problem of computing ranks $r_{i}$ to the computation of the limiting probabilities. An alternative algorithm to compute nodes' ranks is to run the probability updates in \eqref{eqn_prob_evolution_scalar} and approximate

for a sufficiently large $n$. This latter algorithm converges much faster than the one based on random visits. The outcome of applying this ranking algorithm to the network in Fig. 1 is shown in Fig. 3.