Wednesday, April 14, 2010

Thesis - Sharding the Neo4j Graph DB

My name's Alex, I'm currently a distributed systems masters student at KTH, Stockholm, and part way through my thesis at Neo Technology and SICS.
The project is a dual thesis that another student - Martin Neumann - and I are working on together.

What this project is
Graph databases are unique in that they're optimized to model highly connected data, and query it efficiently using graph traversal patterns. The benefits of this approach are numerous [4], but won't be discussed here. Our focus is on understanding the implications that highly connected data has when sharding (partitioning). We intend to investigate the requirements of and start developing an auto sharding framework for the Neo4j graph database. Although it's still early days, this post will (try to) summarize our progress so far and outline ideas for future work.

What this project isn't
There are problems we won't have time to tackle during the project.
These are assumed to be solvable and will be considered in design decisions, but won't be solved.

Replication: although essential to any fault tolerant distributed system, will not be implemented.

Graph partitioning algorithm design: 100's of graph partitioning algorithms exist already. We intend to do a broad sweep of the field then identify and evaluate those that appear most relevant, but designing them is not the goal. Graph partitioning algorithms are modern day black magic and their designers are experts in the dark arts. We'll leave the trickery to them and try to create a modular design that supports "pluggable" partitioning algorithms.

Approaching the problem from multiple angles...

Sharding at insert-time
Maybe no links exist between data (E.g. key-value stores), or the graph topology can be guesstimated ahead of time (E.g. GIS applications). In these cases data can be allocated to shards at insert time.
Your favorite DHT can already do this by sharding along the key space, but coupling sharding logic to a particular data structure is limiting.
I haven't looked into Twitter's Gizzard framework in detail but it seems to be powerful in this respect, allowing users to define custom "hash" functions.

Decoupled Components
As in Gizzard, it's desirable to abstract the insert time sharding logic into a pluggable component.
We define the Insert-Sharding component for this purpose.

Inputs: Insert-Sharding-Function , Record.
Outputs: Shard-Mapping (mapping between data and shards).

Sharding at run-time
Thanks to the world wide interweb many applications try to model dynamic, constantly evolving domains. Social networks, for example, likely change every time a user starts/ends a relationship, moves to another city, etc.
Insert-Sharding decisions are cheap and effective in many situations, but also have a shelf life.
Not only may the Insert-Sharding scheme need to be updated periodically, but as data and/or access patterns evolve the already sharded data may need to be reallocated to new shards.

Decoupled Components
As with Insert-Sharding, it's desirable to abstract the runtime sharding logic into pluggable components.

Sharding at runtime relies on accurate, current information to make intelligent decisions. User access patterns, shard sizes, links between data, traffic... these are examples of metrics that can be used as input parameters to a runtime sharding algorithm. Applications have differing requirements regarding metrics they log, and metrics they need for runtime sharding. We define a Runtime-Logging component that encapsulates the logging function.

Inputs: Runtime-Logging-Function.
Outputs: Runtime-Metrics.

Implementation of the runtime sharding algorithm is domain specific. To fulfil the goal of being pluggable we loosely define the Runtime-Sharding component.

Inputs: Runtime-Sharding-Function, Runtime-Metrics, Change-Log (recent CRUD operations).
Outputs: Shard-Mapping.

Note, graph partitioning algorithms may be a natural fit here, but are not essential. Refer to "Partitioning algorithms in a bit more detail..." for more on graph partitioning algorithms and their application to graph databases.

Reallocation of data to different shards may be beneficial, but if migration occurs during peak load performance will suffer and the process becomes counter productive. Shard-Mapping produced by the Runtime-Sharding component are instructions on where to migrate data, they say nothing about when to migrate.
For that purpose we need a module that is responsible for deciding when data migration should occur, and then issues commands to the shard servers to perform the migrations.
To perform these tasks we define the Migration-Scheduler component.
Inputs: Migration-Scheduler-Function, Shard-Mapping(, Runtime-Metrics).
Outputs: Migration-Commands.

A loosely coupled framework
This results in the first incarnation of our vision for a loosely coupled, extensible sharding framework.
Note, we've purposely emitted finer details. Partly for readability... and partly because we don't have all the details defined yet.

Partitioning algorithms in a bit more detail...

Graph partitioning algorithms
To massively over simplify, graph partitioning algorithms partition a graph into a number of disjoint subgraphs, while attempting to minimize edge cut, minimize the size difference between partitions (shards), minimize conductance of individual partitions, and/or maximize modularity of the partitioned graph.
The goal is to partition the graph such that traversal operations execute locally as much as possible, and are required to perform network hops (to different shards) infrequently.

Difficulties in partitioning graphs
Optimal graph partitioning is known to be NP-hard. O(n²) complexity is not uncommon, n can be >> 1,000,000,000, and with graphs of this size it becomes unrealistic to fit all state in RAM.
Each algorithm often performs well on a set of graph topologies, then is ineffective on others.
They often don't cope with dynamism (CRUD), require access to the entire graph, or are inherently sequential.
Moreover, many assume the graph has unlabeled edges, undirected edges, and/or unweighted vertices/edges. Unfortunately, every "un" results in some loss of the underlying graph semantics.

The point isn't that these algorithms are impractical, but limitations should be considered and accounted for. There is no one best algorithm, but in most cases a capable algorithm for a given application exists.

Importance of access patterns
When access patterns exist (and are recognizable) taking them into consideration is likely to be beneficial.
To over simplify again, take the - completely hypothetical - example of a social network where the most common operation is, "find my ex-lovers' current relationship status".
Social networks are graphs and this is a very simple graph traversal operation, making this potentially suitable for a graph partitioning algorithm. An algorithm that, during partitioning, places greater importance on not cutting the is_an_ex_lover edge type may do well in this case. Conversely, one that places greater importance on the is_an_old_school_colleague_that_I_never_speak_with may miss the point.
Algorithms we've looked at
As mentioned, we've tried to identify and evaluate a number of graph partitioning algorithms that appear most relevant.

To gauge relevance we first defined "desirable" characteristics of a graph partitioning algorithm, they are:
  • Dynamic: Algorithm can update an existing partitioning when CRUD operations are performed, without completely restarting.
  • Iterative: Rather than computing the partitioning in one operation, it continually improves over time.
  • Smoothness: When updating a partitioning, guarantees are given that the partitioning will not drastically change.
  • Distributed: Algorithm can be distributed across multiple machines, where each machine only stores and processes a subgraph of the global graph. Especially useful for massive graphs and for parallelizing the Runtime-Sharding process.
  • Local View: At no time does the algorithm need to have knowledge of the entire graph.
  • Weighted: Algorithm makes use of edge and/or vertex weights. This allows us to map certain metrics or graph properties to weights, to capture more of the graph's underlying semantics.
Using these characteristics we tracked down and began evaluation on the following algorithms:

DiDiC [1]:
"A Distributed Diffusive Heuristic for Clustering a Virtual P2P Supercomputer"
Dynamic, iterative, distributed, local view, smooth, weighted edges.
Reliant on the existence of clusters. May be slow to converge to a good partitioning, but it's iterative nature partly makes up for this.
Similar in nature to p2p gossip algorithms.

Evolving Set Process [2]:
"Finding sparse cuts locally using evolving sets"
Fast, but extremely reliant on the existence of clusters.
Local view, weighted edges.

Min-cut Tree [3]:
"Dynamic Graph Clustering Using Minimum-Cut Trees "
Dynamic, local view (mostly), smooth (for certain operations).

Non-graph partitioning algorithms
Not all partitioning algorithms are graph partitioning algorithms. Graph partitioning algorithms are suitable when data can be modelled as a graph, and operations as graph traversal patterns [4].
Other methods become more suitable when the above is not the case, or even when the above holds but the graph topology tends toward being random.
Graph partitioning algorithms can be a powerful sharding tool, but only when applicable. Use the best tool for the job.

If it's easy to identify a sharding policy for a given domain, avoid expensive graph partitioning algorithms.
For example, you run a GIS service for two cities. Users of your system (citizens of each of the cities) frequently want travel plans for within their own city, but rarely to the other. In this case storing all data for a city on the same server/cluster is simple, and may be a good solution.

When to what (identifying the best approach)
One long term vision for this project is to create a repository that makes configuration of the sharding framework as painless as possible.

All social networks share commonalities, as do GIS systems, language learning tools, etc.
Identifying these commonalities is not trivial, but we're working with graphs so graph metrics may be useful when attempting to classify these datasets. Metrics such as clustering coefficient, number of vertices, number of edges, graph diameter, etc all tell us something about the topology of the dataset.

It may be feasible to create a public repository containing mappings between these metrics and matching Insert-Sharding-Functions and Runtime-Sharding-Functions. A dataset analysis tool then produces metrics for a given dataset, and these are used to lookup the best fitting Insert-Sharding-Functions and Runtime-Sharding-Functions from the public repository.

For example, if an Insert-Sharding-Function and Runtime-Sharding-Function work well for Facebook, they can likely be applied Myspace too.

Shiny results
As mentioned, we're only part way into our thesis, and even by the end of it we clearly won't have everything complete.
We have implemented, played around with, and plotted preliminary results for a few algorithms already though.

Below is a visualization (using igraph) we created on a sample dataset (~2,500 vertices, ~7,500 egdes) from an online graph archive. This was run for 500 iterations using the following configuration:
  • Insert-Sharding-Function: Random allocation
  • Runtime-Sharding-Function: DiDiC [1]
  • Migration-Scheduler: Instant migration

Thanks too...
The Neo4j crew, our supervisor Sarunas Girdzijauskas, and Marko Rodriguez, f
or their help so far in providing Martin and I with a continual stream of suggestions, deadlines, and ideas!


  1. Why do you shard? To spread the access load? or because there is not enough storage on a single machine? Do you assume any graph node replication (one node in several shards)? what about consistency?

  2. wojtg,
    Why? Like you say, access load and storage limit are the common reasons

    Do we assume replication? No we don't, please see "What this project isn't" for details

    Consistency? No replication at this stage so no consistency concerns

  3. good luck to you. I commend your efforts

  4. First of all, good luck. NP-Hard problems are, well, hard to takle

    I just wanted to share an idea with you. If the main goal of the partitioning is to reduce the distribution of the traversals, why would you use an algorithm that partitions the graph based on the data and not on the traversals you make on it?

    I mean, kind of like proxies do for the web. If you access certain data a lot you'll end up having local copies of it for better performance.

    As I see it, trying to partition the graph based on the data would be like pre-caching pages on proxies whether they'll be used or not.

    Such an approach might also "solve" the replication problem as a by-product.

    Anyway, this is just food for thought, I don't dare say that this is a good solution... But thought it would be good to leave the comment for meditation.



  5. Hi Paolo,
    Thanks for your comment!

    You're right, at present when we partition we only consider the graph topology and not the access patterns performed on it.

    I agree that intuition suggests considering access patterns would be a fruitful approach. For the purposes of this thesis though it has been placed under "future work" due to lack of time.

    I also agree with your observation that replication based on access patterns may be promising. Again... hopefully, future work.

    On a side note, we DID evaluate the results using real access patterns and they were quite promising...
    I'll make our results available after we submit the report!


  6. how is the work going on? when are you going to share your results? i'm very curious...

  7. the thesis report is submitted and being marked right now. when we get it back I'll upload it to this blog. thanks for your interest!

  8. a final draft of our thesis report is downloadable here: