Sunday, October 17, 2010

Say hi to GraphDB-Bench

Over the past month or two I've been working on a graph database benchmarking tool, called GraphDB-Bench. In this post I'll try to explain the whats, whys, and hows of the project. Read on...


Preemptive disclaimer:
The project has just started and is by no means finished. This post is all about introducing the project and hopefully building some momentum behind it. There are still a lot of unanswered questions and unfilled blanks...


What is it?
GraphDB-Bench is a graph database benchmarking tool, which allows the developer to define what type of operations/access-patterns/algorithms a benchmark is comprised of. Additionally, benchmarks are portable: the same benchmark definition (same algorithms, same code, same everything) can be run against graph databases from different vendors.


Neat, but how?
To achieve portability, GraphDB-Bench is built on the TinkerPop stack, making use of Blueprints, Pipes, and Gremlin. These libraries abstract away any knowledge of the underlying graph database implementation, which means benchmarks written in GraphDB-Bench can be executed on any Blueprints-supported graph database: Neo4j, OrientDB, TinkerGraph, InfiniteGraph (soon), Redis (soon),  and more.
To see what I mean by  "developers can define what type of operations a benchmark is comprised of" please see the wiki documentation.


Ok, why does the world need this?
Graph databases are not a hammer, but when applied to the right problem (see The Graph Traversal Pattern and Constructions from Dots and Lines) they outperform the vast majority of other data storage technologies. This has lead to the growing popularity of graph databases. However, ask Joe-above-Average if he knows what a graph database is and he'll probably say no - they're taking over the world, but haven't done so yet.
At present, developers don't know how many graph database alternatives exist, let alone how to compare them. Moreover, they probably don't have the time, resources, or desire to thoroughly compare the available solutions. 
Wouldn't it be sweet if we had an unbiased (read: vendor agnostic), open, extensible, community driven graph database benchmarking tool... and a public repository of the benchmark results that it produces? Yeah, it would.


What are the main contributions?
Portability, again... to reiterate, in GraphDB-Bench benchmarks are portable: once written, they can be executed on any Blueprints-supported graph database. When a new graph database gets Blueprints support all existing benchmarks can be run on it, quickly telling us where that database sits in the competitive landscape.


Fairness... benchmarking is a notoriously biased activity. Basically, you only see something if you're actively looking for it. One problem with this is benchmarks may target a products strengths and ignore it's weaknesses. Another is that different applications perform different types of operations on the database. If a benchmark tests for every type of operation except the ones your application uses, the results will be of limited value.
GraphDB-Bench is very extensible, it allows the developer to define what a benchmark will "look for". This encourages people to define and develop benchmarks that interest them and match their use cases. More importantly, because GraphDB-Bench lets anyone define, extend, and combine benchmarks, it makes it easy for developers to collaborate... hopefully resulting in a community that together decides what aspects of graph databases benchmarks should test.
All of this doesn't make bias go away, but shifts the bias towards the needs of an open community, rather than the needs of an individual/company.


It's not all about "my database is bigger than yours"
Although GraphDB-Bench was first and foremost designed to be a benchmarking tool, it may be useful for a few other purposes. 
If you read the wiki you'll see that GraphDB-Bench makes it easy to execute an arbitrary number of operations (think of an operation as one run of an algorithm) against a database, each with different input parameters (e.g. the traversal starts from a different start vertex each time). Once executed, the running time and output of each operation can be seen in the result logs. This functionality may make it suitable as a tool for comparing the performance/correctness of different algorithm implementations on the same dataset, or of one algorithm on different datasets.


What about benchmark results?
Sorry, not yet... we're working on it.
This project aims to add value to the graph database ecosystem by providing a vendor agnostic evaluation tool. A users group discussion has started about how to best carry out the benchmarking process, ensuring minimum bias and maximum transparency. As soon as that gets sorted the first round of benchmarks will be run, and results will be published on the project wiki.


I like, how can I get some and contribute some back?
Read the wiki, git clone, and join the users group!

Saturday, October 16, 2010

Thesis Report Available

For those interested in reading our thesis report, sorry for the delay in making it publicly available. It actually still hasn't made it through the marking process, so may undergo a few more slight changes. Regardless, I'm grateful to those that have shown interest in our work so I've made the final draft available here: http://arxiv.org/abs/1301.5121.

Regarding follow-up work, it looks like later this year another masters student will base their thesis on ours, extending it further in the direction of social network partitioning. We'll be collaborating with them, and hope to write a joint paper, which combines results from both theses.

I'm low on time right now so won't go into any discussion regarding results/findings. If you're interested, please read the report and ask/discuss/critique freely!

Wednesday, April 14, 2010

Thesis - Sharding the Neo4j Graph DB

Me
My name's Alex, I'm currently a distributed systems masters student at KTH, Stockholm, and part way through my thesis at Neo Technology and SICS.
The project is a dual thesis that another student - Martin Neumann - and I are working on together.

What this project is
Graph databases are unique in that they're optimized to model highly connected data, and query it efficiently using graph traversal patterns. The benefits of this approach are numerous [4], but won't be discussed here. Our focus is on understanding the implications that highly connected data has when sharding (partitioning). We intend to investigate the requirements of and start developing an auto sharding framework for the Neo4j graph database. Although it's still early days, this post will (try to) summarize our progress so far and outline ideas for future work.

What this project isn't
There are problems we won't have time to tackle during the project.
These are assumed to be solvable and will be considered in design decisions, but won't be solved.

Replication: although essential to any fault tolerant distributed system, will not be implemented.

Graph partitioning algorithm design: 100's of graph partitioning algorithms exist already. We intend to do a broad sweep of the field then identify and evaluate those that appear most relevant, but designing them is not the goal. Graph partitioning algorithms are modern day black magic and their designers are experts in the dark arts. We'll leave the trickery to them and try to create a modular design that supports "pluggable" partitioning algorithms.

Approaching the problem from multiple angles...

Sharding at insert-time
Maybe no links exist between data (E.g. key-value stores), or the graph topology can be guesstimated ahead of time (E.g. GIS applications). In these cases data can be allocated to shards at insert time.
Your favorite DHT can already do this by sharding along the key space, but coupling sharding logic to a particular data structure is limiting.
I haven't looked into Twitter's Gizzard framework in detail but it seems to be powerful in this respect, allowing users to define custom "hash" functions.

Decoupled Components
As in Gizzard, it's desirable to abstract the insert time sharding logic into a pluggable component.
We define the Insert-Sharding component for this purpose.

Insert-Sharding
Inputs: Insert-Sharding-Function , Record.
Outputs: Shard-Mapping (mapping between data and shards).

Sharding at run-time
Thanks to the world wide interweb many applications try to model dynamic, constantly evolving domains. Social networks, for example, likely change every time a user starts/ends a relationship, moves to another city, etc.
Insert-Sharding decisions are cheap and effective in many situations, but also have a shelf life.
Not only may the Insert-Sharding scheme need to be updated periodically, but as data and/or access patterns evolve the already sharded data may need to be reallocated to new shards.

Decoupled Components
As with Insert-Sharding, it's desirable to abstract the runtime sharding logic into pluggable components.

Sharding at runtime relies on accurate, current information to make intelligent decisions. User access patterns, shard sizes, links between data, traffic... these are examples of metrics that can be used as input parameters to a runtime sharding algorithm. Applications have differing requirements regarding metrics they log, and metrics they need for runtime sharding. We define a Runtime-Logging component that encapsulates the logging function.

Runtime-Logging
Inputs: Runtime-Logging-Function.
Outputs: Runtime-Metrics.

Implementation of the runtime sharding algorithm is domain specific. To fulfil the goal of being pluggable we loosely define the Runtime-Sharding component.

Runtime-Sharding
Inputs: Runtime-Sharding-Function, Runtime-Metrics, Change-Log (recent CRUD operations).
Outputs: Shard-Mapping.

Note, graph partitioning algorithms may be a natural fit here, but are not essential. Refer to "Partitioning algorithms in a bit more detail..." for more on graph partitioning algorithms and their application to graph databases.

Reallocation of data to different shards may be beneficial, but if migration occurs during peak load performance will suffer and the process becomes counter productive. Shard-Mapping produced by the Runtime-Sharding component are instructions on where to migrate data, they say nothing about when to migrate.
For that purpose we need a module that is responsible for deciding when data migration should occur, and then issues commands to the shard servers to perform the migrations.
To perform these tasks we define the Migration-Scheduler component.
Migration-Scheduler
Inputs: Migration-Scheduler-Function, Shard-Mapping(, Runtime-Metrics).
Outputs: Migration-Commands.

A loosely coupled framework
This results in the first incarnation of our vision for a loosely coupled, extensible sharding framework.
Note, we've purposely emitted finer details. Partly for readability... and partly because we don't have all the details defined yet.



Partitioning algorithms in a bit more detail...

Graph partitioning algorithms
To massively over simplify, graph partitioning algorithms partition a graph into a number of disjoint subgraphs, while attempting to minimize edge cut, minimize the size difference between partitions (shards), minimize conductance of individual partitions, and/or maximize modularity of the partitioned graph.
The goal is to partition the graph such that traversal operations execute locally as much as possible, and are required to perform network hops (to different shards) infrequently.

Difficulties in partitioning graphs
Optimal graph partitioning is known to be NP-hard. O(n²) complexity is not uncommon, n can be >> 1,000,000,000, and with graphs of this size it becomes unrealistic to fit all state in RAM.
Each algorithm often performs well on a set of graph topologies, then is ineffective on others.
They often don't cope with dynamism (CRUD), require access to the entire graph, or are inherently sequential.
Moreover, many assume the graph has unlabeled edges, undirected edges, and/or unweighted vertices/edges. Unfortunately, every "un" results in some loss of the underlying graph semantics.

The point isn't that these algorithms are impractical, but limitations should be considered and accounted for. There is no one best algorithm, but in most cases a capable algorithm for a given application exists.

Importance of access patterns
When access patterns exist (and are recognizable) taking them into consideration is likely to be beneficial.
To over simplify again, take the - completely hypothetical - example of a social network where the most common operation is, "find my ex-lovers' current relationship status".
Social networks are graphs and this is a very simple graph traversal operation, making this potentially suitable for a graph partitioning algorithm. An algorithm that, during partitioning, places greater importance on not cutting the is_an_ex_lover edge type may do well in this case. Conversely, one that places greater importance on the is_an_old_school_colleague_that_I_never_speak_with may miss the point.
Algorithms we've looked at
As mentioned, we've tried to identify and evaluate a number of graph partitioning algorithms that appear most relevant.

To gauge relevance we first defined "desirable" characteristics of a graph partitioning algorithm, they are:
  • Dynamic: Algorithm can update an existing partitioning when CRUD operations are performed, without completely restarting.
  • Iterative: Rather than computing the partitioning in one operation, it continually improves over time.
  • Smoothness: When updating a partitioning, guarantees are given that the partitioning will not drastically change.
  • Distributed: Algorithm can be distributed across multiple machines, where each machine only stores and processes a subgraph of the global graph. Especially useful for massive graphs and for parallelizing the Runtime-Sharding process.
  • Local View: At no time does the algorithm need to have knowledge of the entire graph.
  • Weighted: Algorithm makes use of edge and/or vertex weights. This allows us to map certain metrics or graph properties to weights, to capture more of the graph's underlying semantics.
Using these characteristics we tracked down and began evaluation on the following algorithms:

DiDiC [1]:
"A Distributed Diffusive Heuristic for Clustering a Virtual P2P Supercomputer"
Dynamic, iterative, distributed, local view, smooth, weighted edges.
Reliant on the existence of clusters. May be slow to converge to a good partitioning, but it's iterative nature partly makes up for this.
Similar in nature to p2p gossip algorithms.

Evolving Set Process [2]:
"Finding sparse cuts locally using evolving sets"
Fast, but extremely reliant on the existence of clusters.
Local view, weighted edges.

Min-cut Tree [3]:
"Dynamic Graph Clustering Using Minimum-Cut Trees "
Dynamic, local view (mostly), smooth (for certain operations).

Non-graph partitioning algorithms
Not all partitioning algorithms are graph partitioning algorithms. Graph partitioning algorithms are suitable when data can be modelled as a graph, and operations as graph traversal patterns [4].
Other methods become more suitable when the above is not the case, or even when the above holds but the graph topology tends toward being random.
Graph partitioning algorithms can be a powerful sharding tool, but only when applicable. Use the best tool for the job.

If it's easy to identify a sharding policy for a given domain, avoid expensive graph partitioning algorithms.
For example, you run a GIS service for two cities. Users of your system (citizens of each of the cities) frequently want travel plans for within their own city, but rarely to the other. In this case storing all data for a city on the same server/cluster is simple, and may be a good solution.

When to what (identifying the best approach)
One long term vision for this project is to create a repository that makes configuration of the sharding framework as painless as possible.

All social networks share commonalities, as do GIS systems, language learning tools, etc.
Identifying these commonalities is not trivial, but we're working with graphs so graph metrics may be useful when attempting to classify these datasets. Metrics such as clustering coefficient, number of vertices, number of edges, graph diameter, etc all tell us something about the topology of the dataset.

It may be feasible to create a public repository containing mappings between these metrics and matching Insert-Sharding-Functions and Runtime-Sharding-Functions. A dataset analysis tool then produces metrics for a given dataset, and these are used to lookup the best fitting Insert-Sharding-Functions and Runtime-Sharding-Functions from the public repository.

For example, if an Insert-Sharding-Function and Runtime-Sharding-Function work well for Facebook, they can likely be applied Myspace too.

Shiny results
As mentioned, we're only part way into our thesis, and even by the end of it we clearly won't have everything complete.
We have implemented, played around with, and plotted preliminary results for a few algorithms already though.

Below is a visualization (using igraph) we created on a sample dataset (~2,500 vertices, ~7,500 egdes) from an online graph archive. This was run for 500 iterations using the following configuration:
  • Insert-Sharding-Function: Random allocation
  • Runtime-Sharding-Function: DiDiC [1]
  • Migration-Scheduler: Instant migration

Thanks too...
The Neo4j crew, our supervisor Sarunas Girdzijauskas, and Marko Rodriguez, f
or their help so far in providing Martin and I with a continual stream of suggestions, deadlines, and ideas!