Classification of large graphs
Classification of large graphs (Классификация больших графов) is based on their size [math]\displaystyle{ n }[/math] where [math]\displaystyle{ n }[/math] is the number of vertices in the graph. It should be noted that realistic graphs are typically extremely sparse, for example, roughly tens to at most hundreds of edges per node on average; thus, the number of nodes and the number of edges are both [math]\displaystyle{ O(n) }[/math].
Small graphs (under 10k vertices). At this size, standard algorithms run easily. For instance, computing all-pairs, shortest paths takes [math]\displaystyle{ O(n^3) }[/math] time and [math]\displaystyle{ O(n^2) }[/math] memory. This is not a problem for any modern computer.
Large small graphs (10k – 1M vertices). These are graphs where [math]\displaystyle{ O(n^2) }[/math] time algorithms are possible, but [math]\displaystyle{ O(n^2) }[/math] memory algorithms become prohibitive or impossible. These graphs more strongly associated with small graphs though, because there are many tasks, such as diameter computations, that can be done exactly on these graphs, with some additional time. Two differences are worth noting: (1) the most important properties of graphs in this regime (and larger) are very different than the properties of small graphs, and (2) even if quadratic time computations are possible, they can be challenging, and they can become prohibitive if they are used in an exploratory data analysis mode or as part of a large cross-validation computation.
Small large graphs (1M – 100M vertices). In many ways the transition between small and large graphs occurs around one million vertices. For instance, with a graph of five million vertices, algorithms that do [math]\displaystyle{ O(n^2) }[/math] computations are generally infeasible without specialized computing resources. That said, with appropriate considerations being given to computational issues, graphs with between 1M and 100M vertices are reasonably easy to mine with fairly sophisticated techniques given modest computing resources. The basic reason for this is the extreme sparsity of real-world networks. Real-world graphs in this size regime typically have an average degree between 5 and 100. Thus, even a large real-world graph would have at most a few billion edges. This would consume a few gigabytes of memory and could easily be tackled on a modern laptop or desktop computer with 32 Gb of memory. For instance, computing a PageRank vector on a graph with 60M vertices takes a few minutes on a modern laptop.
Large graphs (100M – 10B vertices). With a few hundred million or even a few billion vertices, the complexity of running even simple graph algorithms increases. However, in this regime, even the largest public networks will fit into main memory on large shared memory systems with around 1TB of main memory. This was the motivation of the Ligra project. Also, the effective diameter computations on the Facebook networks with 70B edges were done on a single shared memory machine.
LARGE graphs (over 10B vertices). With over 10 billion vertices, even shared memory systems are unable to cope with the scale of the networks. The particular number defining this threshold will no doubt become outdated at some point, but there are and will continue to be sufficiently massive networks, where shared memory machines no longer work and specialized distributed techniques are required. This is the case, for instance, with the entire web graph. Note that while global mining, tasks may require a distributed memory computer, it is often possible to extract far smaller subsets of these LARGE graphs that are only large and can be easily handled on a shared memory computer. The types of problems that we can expect to solve on extremely LARGE graphs are only very simple things such as triangles, connected components, PageRank, and label propagation.
Литература
- Buhlmann P., Drineas P., Kane M., van der Laan M. (eds.) Handbook of Big Data. — CRC Press, 2016.