<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-4782139802712335622</id><updated>2011-11-24T00:13:14.648-08:00</updated><category term='graph partitioning'/><category term='thesis'/><category term='gremlin'/><category term='neo4j'/><category term='blueprints'/><category term='nosql'/><category term='graph clustering'/><category term='benchmark'/><category term='lasers'/><category term='graphdb-bench'/><category term='graph'/><category term='graph database'/><category term='tinkerpop'/><category term='orientdb'/><title type='text'>random gossip</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://alexaverbuch.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4782139802712335622/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://alexaverbuch.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Alex Averbuch</name><uri>http://www.blogger.com/profile/09276752613359599389</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_UNsxewwBXJc/S8SRpenbkaI/AAAAAAAAAAM/aaYoskdC_nQ/s1600-R/d68871d265748f5bb7efdfacb80f6308.png'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>3</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-4782139802712335622.post-5154746709671886031</id><published>2010-10-17T02:23:00.000-07:00</published><updated>2010-10-17T02:51:57.558-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='orientdb'/><category scheme='http://www.blogger.com/atom/ns#' term='tinkerpop'/><category scheme='http://www.blogger.com/atom/ns#' term='neo4j'/><category scheme='http://www.blogger.com/atom/ns#' term='graphdb-bench'/><category scheme='http://www.blogger.com/atom/ns#' term='nosql'/><category scheme='http://www.blogger.com/atom/ns#' term='blueprints'/><category scheme='http://www.blogger.com/atom/ns#' term='graph database'/><category scheme='http://www.blogger.com/atom/ns#' term='graph'/><category scheme='http://www.blogger.com/atom/ns#' term='gremlin'/><category scheme='http://www.blogger.com/atom/ns#' term='benchmark'/><title type='text'>Say hi to GraphDB-Bench</title><content type='html'>&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;Over the past month or two I've been working on a graph database benchmarking tool, called GraphDB-Bench. In this post I'll try to explain the whats, whys, and hows of the project. Read on...&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif; font-size: large;"&gt;Preemptive disclaimer:&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;The project has just started and is by no means finished. This post is all about introducing the project and hopefully building some momentum behind it. There are still a lot of unanswered questions and unfilled blanks...&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif; font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif; font-size: large;"&gt;What is it?&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;GraphDB-Bench is a graph database benchmarking tool, which allows the developer to define what type of operations/access-patterns/algorithms a benchmark is comprised of. Additionally, benchmarks are portable: the same benchmark definition (same algorithms, same code, same everything) can be run against graph databases from different vendors.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif; font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Neat, but how?&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;To achieve portability, GraphDB-Bench is built on the &lt;a href="http://www.tinkerpop.com/"&gt;TinkerPop&lt;/a&gt; stack, making use of &lt;a href="http://www.tinkerpop.com/post/380560703/blueprints-data-models-and-their-implementations"&gt;Blueprints&lt;/a&gt;, &lt;a href="http://www.tinkerpop.com/post/554238754/pipes-a-data-flow-framework-using-process-graphs"&gt;Pipes&lt;/a&gt;, and &lt;a href="http://www.tinkerpop.com/post/370705133/gremlin-a-graph-based-programming-language"&gt;Gremlin&lt;/a&gt;. These libraries abstract away any knowledge of the underlying graph database implementation, which means benchmarks written in GraphDB-Bench can be executed on any Blueprints-supported graph database: &lt;a href="http://neo4j.org/"&gt;Neo4j&lt;/a&gt;, &lt;a href="http://www.orientechnologies.com/"&gt;OrientDB&lt;/a&gt;, &lt;a href="http://github.com/tinkerpop/blueprints/wiki/tinkergraph"&gt;TinkerGraph&lt;/a&gt;, &lt;a href="http://www.infinitegraph.com/"&gt;InfiniteGraph&lt;/a&gt; (soon), &lt;a href="http://code.google.com/p/redis/"&gt;Redis&lt;/a&gt; (&lt;a href="http://github.com/dmitriid/blueredis"&gt;soon&lt;/a&gt;), &amp;nbsp;and more.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;To see what I mean by &amp;nbsp;"&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;developers can define what type of operations a benchmark is comprised of&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;" please&amp;nbsp;&lt;a href="http://github.com/tinkerpop/graphdb-bench/wiki"&gt;see the wiki documentation&lt;/a&gt;.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif; font-size: large;"&gt;&lt;b&gt;Ok, why does the world need this?&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;Graph databases are not a hammer, but when applied to the right problem (see&amp;nbsp;&lt;a href="http://arxiv.org/abs/1004.1001"&gt;The Graph Traversal Pattern&lt;/a&gt; and&amp;nbsp;&lt;a href="http://arxiv.org/abs/1006.2361"&gt;Constructions from Dots and Lines&lt;/a&gt;) they outperform the vast majority of other data storage technologies. This has lead to the growing popularity of graph databases. However, ask Joe-above-Average if he knows what a graph database is and he'll probably say no - they're taking over the world, but haven't done so yet.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;At present, developers don't know how many graph database alternatives exist, let alone how to compare them. Moreover, they probably don't have the time, resources, or desire to thoroughly compare the available solutions.&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;Wouldn't it be sweet if we had an unbiased (read: vendor agnostic), open, extensible, community driven graph database benchmarking tool... and a public repository of the benchmark results that it produces? Yeah, it would.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif; font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif; font-size: large;"&gt;&lt;b&gt;What are the main contributions?&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;Portability, again...&amp;nbsp;to reiterate, in GraphDB-Bench benchmarks are portable: once written, they can be executed on any Blueprints-supported graph database. When a new graph database gets Blueprints support all existing benchmarks can be run on it, quickly telling us where that database sits in the competitive landscape.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif; font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;Fairness...&amp;nbsp;benchmarking is a notoriously biased activity. Basically, you only see something if you're actively looking for it. One problem with this is benchmarks may target a products strengths and ignore it's weaknesses. Another is that different applications perform different types of operations on the database. If a benchmark tests for every type of operation except the ones your application uses, the results will be of limited value.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;GraphDB-Bench is very extensible, it allows the developer to define what a benchmark will "look for". This encourages people to define and develop benchmarks that interest them and match their use cases. More importantly, because GraphDB-Bench lets anyone define, extend, and combine benchmarks, it makes it easy for developers to collaborate... hopefully resulting in a community that together decides what aspects of graph databases benchmarks should test.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;All of this doesn't make bias go away, but shifts the bias towards the needs of an open community, rather than the needs of an individual/company.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif; font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif; font-size: large;"&gt;&lt;b&gt;It's not all about "my database is bigger than yours"&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;Although GraphDB-Bench was first and foremost designed to be a benchmarking tool, it may be useful for a few other purposes.&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;If you read the &lt;a href="http://github.com/tinkerpop/graphdb-bench/wiki"&gt;wiki&lt;/a&gt; you'll see that GraphDB-Bench makes it easy to execute an arbitrary number of operations (think of an operation as one run of an algorithm) against a database, each with different input parameters (e.g. the traversal starts from a different start vertex each time). Once executed, the running time and output of each operation can be seen in the result logs.&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;This functionality may&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;make it suitable as a tool for comparing the performance/correctness of different algorithm implementations on the same dataset, or of one algorithm on different datasets.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif; font-size: large;"&gt;&lt;b&gt;What about benchmark results?&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;Sorry, not yet... we're working on it.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;This project aims to add value to the graph database ecosystem by providing a vendor agnostic evaluation tool. A &lt;a href="http://groups.google.com/group/gremlin-users/browse_thread/thread/1678377d8745f424"&gt;users group discussion&lt;/a&gt; has started about how to best carry out the benchmarking process, ensuring minimum bias and maximum transparency. As soon as that gets sorted the first round of benchmarks will be run, and results will be published on the project wiki.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif; font-size: large;"&gt;&lt;b&gt;I like, how can I get some and contribute some back?&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;Read the &lt;a href="http://github.com/tinkerpop/graphdb-bench/wiki"&gt;wiki&lt;/a&gt;, &lt;a href="http://github.com/tinkerpop/graphdb-bench"&gt;git clone&lt;/a&gt;, and join the &lt;a href="http://groups.google.com/group/gremlin-users"&gt;users group&lt;/a&gt;!&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4782139802712335622-5154746709671886031?l=alexaverbuch.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://alexaverbuch.blogspot.com/feeds/5154746709671886031/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://alexaverbuch.blogspot.com/2010/10/say-hi-to-graphdb-bench.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4782139802712335622/posts/default/5154746709671886031'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4782139802712335622/posts/default/5154746709671886031'/><link rel='alternate' type='text/html' href='http://alexaverbuch.blogspot.com/2010/10/say-hi-to-graphdb-bench.html' title='Say hi to GraphDB-Bench'/><author><name>Alex Averbuch</name><uri>http://www.blogger.com/profile/09276752613359599389</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_UNsxewwBXJc/S8SRpenbkaI/AAAAAAAAAAM/aaYoskdC_nQ/s1600-R/d68871d265748f5bb7efdfacb80f6308.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4782139802712335622.post-3282542311042894062</id><published>2010-10-16T04:51:00.000-07:00</published><updated>2010-10-16T04:54:51.832-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='neo4j'/><category scheme='http://www.blogger.com/atom/ns#' term='graph clustering'/><category scheme='http://www.blogger.com/atom/ns#' term='nosql'/><category scheme='http://www.blogger.com/atom/ns#' term='graph partitioning'/><category scheme='http://www.blogger.com/atom/ns#' term='thesis'/><category scheme='http://www.blogger.com/atom/ns#' term='graph database'/><category scheme='http://www.blogger.com/atom/ns#' term='graph'/><title type='text'>Thesis Report Available</title><content type='html'>For those interested in reading our thesis report, sorry for the delay in making it publicly available. It actually still hasn't made it through the marking process, so may undergo a few more slight changes. Regardless, I'm grateful to those that have shown interest in our work so I've made the &lt;a href="http://dl.dropbox.com/u/1552664/Master%20Thesis%20-%20Partitioning%20Graph%20Databases%20-%202010%20-%20Alex%20Averbuch%20%26%20Martin%20Neumann.pdf"&gt;final draft available here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Regarding follow-up work, it looks like later this year another masters student will base their thesis on ours, extending it further in the direction of social network partitioning. We'll be collaborating with them, and hope to write a joint paper, which combines results from both theses.&lt;br /&gt;&lt;br /&gt;I'm low on time right now so won't go into any discussion regarding results/findings. If you're interested, please read the report and ask/discuss/critique freely!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4782139802712335622-3282542311042894062?l=alexaverbuch.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://alexaverbuch.blogspot.com/feeds/3282542311042894062/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://alexaverbuch.blogspot.com/2010/10/thesis-report-available.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4782139802712335622/posts/default/3282542311042894062'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4782139802712335622/posts/default/3282542311042894062'/><link rel='alternate' type='text/html' href='http://alexaverbuch.blogspot.com/2010/10/thesis-report-available.html' title='Thesis Report Available'/><author><name>Alex Averbuch</name><uri>http://www.blogger.com/profile/09276752613359599389</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_UNsxewwBXJc/S8SRpenbkaI/AAAAAAAAAAM/aaYoskdC_nQ/s1600-R/d68871d265748f5bb7efdfacb80f6308.png'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4782139802712335622.post-2139376882520026620</id><published>2010-04-14T03:42:00.002-07:00</published><updated>2010-04-23T02:57:43.254-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='lasers'/><category scheme='http://www.blogger.com/atom/ns#' term='neo4j'/><category scheme='http://www.blogger.com/atom/ns#' term='graph clustering'/><category scheme='http://www.blogger.com/atom/ns#' term='nosql'/><category scheme='http://www.blogger.com/atom/ns#' term='graph partitioning'/><category scheme='http://www.blogger.com/atom/ns#' term='graph database'/><category scheme='http://www.blogger.com/atom/ns#' term='graph'/><title type='text'>Thesis - Sharding the Neo4j Graph DB</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_UNsxewwBXJc/S8YYQlo8z_I/AAAAAAAAABA/JT-46hEi-CI/s1600/architecture.png"&gt;&lt;/a&gt;&lt;div style="text-align: left;"&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Me&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;My name's Alex, I'm currently a distributed systems masters student at KTH, Stockholm, and part way through my thesis at &lt;/span&gt;&lt;a href="http://neotechnology.com/"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Neo Technology&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; and &lt;/span&gt;&lt;a href="http://sics.se/"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;SICS&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;The project is a dual thesis that another student - Martin Neumann - and I are working on together. &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;What this project is&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Graph databases are unique in that they're optimized to model highly connected data, and query it efficiently using graph traversal patterns. &lt;/span&gt;&lt;span class="Apple-style-span"  style=" ;font-size:small;"&gt;The benefits of this approach are numerous &lt;a href="http://arxiv.org/abs/1004.1001"&gt;[4]&lt;/a&gt;, but won't be discussed here. &lt;/span&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Our focus is on understanding the implications that highly connected data has when sharding (partitioning). &lt;/span&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;We intend to investigate the requirements of and start developing an auto sharding framework for the &lt;/span&gt;&lt;a href="http://neo4j.org/"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Neo4j graph database&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;. &lt;/span&gt;&lt;span class="Apple-style-span"  style=" ;font-size:small;"&gt;Although it's still early days, this post will (try to) summarize our progress so far and outline ideas for future work.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;What this project isn't&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;There are problems we won't have time to tackle during the project.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;These are assumed to be solvable and will be considered in design decisions, but won't be solved.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Replication&lt;/span&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;: although essential to any fault tolerant distributed system, will not be implemented.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Graph partitioning algorithm design&lt;/span&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;: 100's of graph partitioning algorithms exist already. We intend to do a broad sweep of the field then identify and evaluate those that appear most relevant, but designing them is not the goal. Graph partitioning algorithms are modern day black magic and their designers are experts in the dark arts. We'll leave the trickery to them and try to create a modular design that supports "pluggable" partitioning algorithms.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Approaching the problem from multiple angles...&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Sharding at insert-time&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Maybe no links exist between data (E.g. key-value stores), or the graph topology can be guesstimated ahead of time (E.g. GIS applications). In these cases data can be allocated to shards at insert time.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Your favorite DHT can already do this by sharding along the key space, but coupling sharding logic to a particular data structure is limiting.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;I haven't looked into &lt;/span&gt;&lt;a href="http://engineering.twitter.com/2010/04/introducing-gizzard-framework-for.html"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Twitter's Gizzard framework&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; in detail but it seems to be powerful in this respect, allowing users to define custom "hash" functions. &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;Decoupled Components &lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;b&gt;&lt;/b&gt;As in Gizzard, it's desirable to abstract the insert time sharding logic into a pluggable component. &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;We define the &lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Insert-Sharding&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; component for this purpose. &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Insert-Sharding&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Inputs: &lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Insert-Sharding-Function&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;, &lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Record&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Outputs: &lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Shard-Mapping&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; (mapping between data and shards).&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Sharding at run-time&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Thanks to the world wide interweb many applications try to model dynamic, constantly evolving domains. &lt;/span&gt;&lt;span class="Apple-style-span"  style=" ;font-size:small;"&gt;Social networks, for example, likely change every time a user starts/ends a relationship, moves to another city, etc.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Insert-Sharding&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; decisions are cheap and effective in many situations, but also have a shelf life.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Not only may the &lt;b&gt;&lt;i&gt;Insert-Sharding&lt;/i&gt;&lt;/b&gt; scheme need to be updated periodically, but as data and/or access patterns evolve the already sharded data may need to be reallocated to new shards.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;Decoupled Components&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;As with Insert-Sharding, it's desirable to abstract the runtime sharding logic into pluggable components.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Sharding at runtime relies on accurate, current information to make intelligent decisions. User access patterns, shard sizes, links between data, traffic... these are examples of metrics that can be used as input parameters to a runtime sharding algorithm. Applications have differing requirements regarding metrics they log, and metrics they need for runtime sharding. We define a &lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Runtime-Logging&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; component that encapsulates the logging function.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Runtime-Logging&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Inputs: &lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Runtime-Logging-Function&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Outputs: &lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Runtime-Metrics&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Implementation of the runtime sharding algorithm is domain specific. To fulfil the goal of being pluggable we loosely define the &lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Runtime-Sharding&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; component.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Runtime-Sharding&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Inputs: &lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Runtime-Sharding-Function&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;, &lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Runtime-Metrics&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;, &lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Change-Log&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; (recent CRUD operations).&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Outputs: &lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Shard-Mapping&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;.&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style=" white-space: pre; "&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style=" white-space: pre; "&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;span class="Apple-style-span"  style="font-style: normal;  font-size:16px;"&gt;&lt;div&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style=" ;font-size:small;"&gt;Note, graph partitioning algorithms may be a natural fit here, but are not essential. Refer to "&lt;/span&gt;&lt;/i&gt;&lt;i&gt;&lt;span class="Apple-style-span" style="font-style: normal; "&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style=" ;font-size:small;"&gt;Partitioning algorithms in a bit more detail...&lt;span class="Apple-style-span" style="font-weight: normal; "&gt;&lt;i&gt;" for more on graph partitioning algorithms and their application to graph databases.&lt;/i&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/i&gt;&lt;/div&gt;&lt;div&gt;&lt;i&gt;&lt;span class="Apple-style-span" style="font-style: normal; "&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style=" ;font-size:small;"&gt;&lt;span class="Apple-style-span" style="font-weight: normal; "&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/i&gt;&lt;/div&gt;&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Reallocation of data to different shards may be beneficial, but if migration occurs during peak load performance will suffer and the process becomes counter productive. &lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Shard-Mapping &lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;produced by the &lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Runtime-Sharding&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; component are instructions on &lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;where&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; to migrate data, they say nothing about &lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;when&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; to migrate.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;For that purpose we need a module that is responsible for deciding when data migration should occur, and then issues commands to the shard servers to perform the migrations.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;To perform these tasks we define the &lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Migration-Scheduler&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; component.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;&lt;span class="Apple-style-span" style="white-space: normal; "&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Migration-Scheduler&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Inputs: &lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Migration-Scheduler-Function&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;, &lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Shard-Mapping&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;(, &lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Runtime-Metrics&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;).&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Outputs: &lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Migration-Commands&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="color:#006600;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;A loosely coupled framework&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;This results in the first incarnation of our vision for a loosely coupled, extensible sharding framework.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Note, we've purposely emitted finer details. Partly for readability... and partly because we don't have all the details defined yet.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color:#FF0000;"&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color:#FF0000;"&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 0); font-weight: normal; "&gt;&lt;a href="http://3.bp.blogspot.com/_UNsxewwBXJc/S8YYQlo8z_I/AAAAAAAAABA/JT-46hEi-CI/s1600/architecture.png"&gt;&lt;img src="http://3.bp.blogspot.com/_UNsxewwBXJc/S8YYQlo8z_I/AAAAAAAAABA/JT-46hEi-CI/s400/architecture.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5460078271494148082" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 247px; height: 400px; " /&gt;&lt;/a&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Partitioning algorithms in a bit more detail...&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;Graph partitioning algorithms&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;To massively over simplify, &lt;/span&gt;&lt;a href="http://en.wikipedia.org/wiki/Graph_partitioning"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;graph partitioning algorithms&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; partition a graph into a number of disjoint subgraphs, while attempting to minimize &lt;/span&gt;&lt;a href="http://en.wiktionary.org/wiki/Appendix:Glossary_of_graph_theory"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;edge cut&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;, minimize the size difference between partitions (shards), minimize &lt;/span&gt;&lt;a href="http://en.wikipedia.org/wiki/Conductance_%28graph%29"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;conductance&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; of individual partitions, and/or maximize &lt;/span&gt;&lt;a href="http://en.wikipedia.org/wiki/Modularity_%28networks%29"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;modularity&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; of the partitioned graph.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;The goal is to partition the graph such that traversal operations execute locally as much as possible, and are required to perform network hops (to different shards) infrequently.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="white-space: pre;  font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;Difficulties in partitioning graphs&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Optimal graph partitioning is known to be NP-hard. &lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;O&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;(n²) complexity is not uncommon, n can be &gt;&gt; 1,000,000,000, and with graphs of this size it becomes unrealistic to fit all state in RAM.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Each algorithm often performs well on a set of graph topologies, then is ineffective on others. &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;They often don't cope with dynamism (CRUD), require access to the entire graph, or are inherently sequential. &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Moreover, many assume the graph has unlabeled edges, undirected edges, and/or unweighted vertices/edges. Unfortunately, every "un" results in some loss of the underlying graph semantics.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;The point isn't that these algorithms are impractical, but limitations should be considered and accounted for. There is no one best algorithm, but in most cases a capable algorithm for a given application exists.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;Importance of access patterns&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;When access patterns exist (and are recognizable) taking them into consideration is likely to be beneficial.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;To over simplify again, take the - completely hypothetical - example of a social network where the most common operation is, "find my ex-lovers' current relationship status".&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Social networks are graphs and this is a very simple graph traversal operation, making this potentially suitable for a graph partitioning algorithm. An algorithm that, during partitioning, places greater importance on not cutting the &lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;is_an_ex_lover&lt;/span&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; edge type may do well in this case. Conversely, one that places greater importance on the &lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;is_an_old_school_colleague_that_I_never_speak_with&lt;/span&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; may miss the point.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;Algorithms we've looked at&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;As mentioned, we've tried to identify and evaluate a number of graph partitioning algorithms that appear most relevant.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;To gauge relevance we first defined "desirable" characteristics of a graph partitioning algorithm, they are:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Dynamic&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;: Algorithm can update an existing partitioning when CRUD operations are performed, without completely restarting.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Iterative&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;:&lt;/span&gt;&lt;span class="Apple-style-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Rather than computing the partitioning in one operation, it continually improves over time.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Smoothness&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;: When updating a partitioning, guarantees are given that the partitioning will not drastically change.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Distributed&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;:&lt;/span&gt;&lt;span class="Apple-style-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Algorithm can be distributed across multiple machines, where each machine only stores and processes a subgraph of the global graph. Especially useful for massive graphs and for parallelizing the &lt;b&gt;&lt;i&gt;Runtime-Sharding&lt;/i&gt;&lt;/b&gt; process.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Local View&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;:&lt;/span&gt;&lt;span class="Apple-style-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;At no time does the algorithm need to have knowledge of the entire graph.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Weighted&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;:&lt;/span&gt;&lt;span class="Apple-style-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Algorithm makes use of edge and/or vertex weights. This allows us to map certain metrics or graph properties to weights, to capture more of the graph's underlying semantics.&lt;/span&gt;&lt;span class="Apple-style-span"  style=" white-space: pre; font-size:small;"&gt;  &lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Using these characteristics we tracked down and began evaluation on the following algorithms:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style=" ;font-size:small;"&gt;DiDiC &lt;a href="http://aeolus.ceid.upatras.gr/scientific-reports/4th-year-reports/distribClust.pdf"&gt;[1]&lt;/a&gt;:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;"A Distributed Diffusive Heuristic for Clustering a Virtual P2P Supercomputer"&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Dynamic, iterative, distributed, local view, smooth, weighted edges.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Reliant on the existence of clusters. May be slow to converge to a good partitioning, but it's iterative nature partly makes up for this.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Similar in nature to p2p gossip algorithms.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style=" ;font-size:small;"&gt;Evolving Set Process &lt;a href="http://portal.acm.org/citation.cfm?id=1536414.1536449"&gt;[2]&lt;/a&gt;:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;"Finding sparse cuts locally using evolving sets"&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Fast, but extremely reliant on the existence of clusters.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Local view, weighted edges.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style=" ;font-size:small;"&gt;Min-cut Tree &lt;a href="http://www.springerlink.com/content/71l74772v0723j04/"&gt;[3]&lt;/a&gt;:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;"Dynamic Graph Clustering Using Minimum-Cut Trees "&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Dynamic, local view (mostly), smooth (for certain operations).&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;Non-graph partitioning algorithms&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Not all partitioning algorithms are graph partitioning algorithms. Graph partitioning algorithms are suitable when data can be modelled as a graph, and operations as graph traversal patterns &lt;a href="http://arxiv.org/abs/1004.1001"&gt;[4]&lt;/a&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Other methods become more suitable when the above is not the case, or even when the above holds but the graph topology tends toward being random.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Graph partitioning algorithms can be a powerful sharding tool, but only when applicable. Use the best tool for the job.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;If it's easy to identify a sharding policy for a given domain, avoid expensive graph partitioning algorithms.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;For example, you run a GIS service for two cities. Users of your system (citizens of each of the cities) frequently want travel plans for within their own city, but rarely to the other. In this case storing all data for a city on the same server/cluster is simple, and may be a good solution.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;When to what (identifying the best approach)&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;One long term vision for this project is to create a repository that makes configuration of the sharding framework as painless as possible.&lt;/span&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;All social networks share commonalities, as do GIS systems, language learning tools, etc.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Identifying these commonalities is not trivial, but we're working with graphs so graph metrics may be useful when attempting to classify &lt;/span&gt;&lt;span class="Apple-style-span"  style=" ;font-size:small;"&gt;these &lt;/span&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;datasets. Metrics such as &lt;a href="http://en.wikipedia.org/wiki/Clustering_coefficient"&gt;clustering coefficient&lt;/a&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;, number of vertices, number of edges, &lt;a href="http://en.wikipedia.org/wiki/Graph_diameter"&gt;graph diameter&lt;/a&gt;, etc all tell us something about the topology of the dataset.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;It may be feasible to create a public repository containing mappings between these metrics and matching &lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Insert-Sharding-Functions&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; and &lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Runtime-Sharding-Functions&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;. A dataset analysis tool then produces metrics for a given dataset, and these are used to lookup the best fitting &lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Insert-Sharding-Functions&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; and &lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Runtime-Sharding-Functions&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; from the public repository.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;For example, if an &lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Insert-Sharding-Function&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; and &lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Runtime-Sharding-Function&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; work well for Facebook, they can likely be applied Myspace too.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Shiny results&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;As mentioned, we're only part way into our thesis, and even by the end of it we clearly won't have everything complete.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;We have implemented, played around with, and plotted preliminary results for a few algorithms already though.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Below is a visualization (using &lt;/span&gt;&lt;a href="http://igraph.sourceforge.net/"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;igraph&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;) we created on a sample dataset (~2,500 vertices, ~7,500 egdes) from an &lt;/span&gt;&lt;a href="http://staffweb.cms.gre.ac.uk/~c.walshaw/partition/#FarhatCaSt88"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;online graph archive&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;. This was run for 500 iterations using the following configuration:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Insert-Sharding-Function&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;: Random allocation&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Runtime-Sharding-Function&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;: DiDiC &lt;a href="http://aeolus.ceid.upatras.gr/scientific-reports/4th-year-reports/distribClust.pdf"&gt;[1]&lt;/a&gt; &lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Migration-Scheduler&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;: Instant migration&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color:#FF0000;"&gt;&lt;b&gt;&lt;object width="425" height="344"&gt;&lt;param name="movie" value="http://www.youtube.com/v/jzkOJyOblNA&amp;amp;hl=en_US&amp;amp;fs=1&amp;amp;"&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;embed src="http://www.youtube.com/v/jzkOJyOblNA&amp;amp;hl=en_US&amp;amp;fs=1&amp;amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color:#FF0000;"&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;span class="Apple-style-span"  style="color: rgb(0, 0, 0); font-weight: normal;  font-size:16px;"&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style=" ;font-size:large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;b&gt;Thanks too...&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;The Neo4j crew, our supervisor Sarunas Girdzijauskas, and Marko Rodriguez, f&lt;/span&gt;&lt;span class="Apple-style-span" style="color: rgb(255, 0, 0); "&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;span class="Apple-style-span"  style="color: rgb(0, 0, 0); font-weight: normal;  font-size:16px;"&gt;&lt;div style="display: inline !important; "&gt;&lt;span class="Apple-style-span"  style=" ;font-size:small;"&gt;&lt;span class="Apple-style-span"  style=" color: rgb(255, 0, 0); font-size:16px;"&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style=" ;font-size:small;"&gt;&lt;span class="Apple-style-span"  style="color: rgb(0, 0, 0); font-weight: normal;  font-size:16px;"&gt;&lt;div style="display: inline !important; "&gt;&lt;span class="Apple-style-span"  style=" ;font-size:small;"&gt;&lt;span class="Apple-style-span"  style=" color: rgb(255, 0, 0); font-size:16px;"&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style=" ;font-size:small;"&gt;&lt;span class="Apple-style-span"  style="color: rgb(0, 0, 0); font-weight: normal;  font-size:16px;"&gt;&lt;div style="display: inline !important; "&gt;&lt;span class="Apple-style-span"  style=" ;font-size:small;"&gt;&lt;span class="Apple-style-span"  style=" color: rgb(255, 0, 0); font-size:16px;"&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style=" ;font-size:small;"&gt;&lt;span class="Apple-style-span"  style="color: rgb(0, 0, 0); font-weight: normal;  font-size:16px;"&gt;&lt;div style="display: inline !important; "&gt;&lt;span class="Apple-style-span"  style=" ;font-size:small;"&gt;or their help so far in providing Martin and I with a continual stream of suggestions, deadlines, and ideas!&lt;/span&gt;&lt;/div&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://aeolus.ceid.upatras.gr/scientific-reports/4th-year-reports/distribClust.pdf"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;[1] J. Gehweiler and H. Meyerhenke. A Distributed Diﬀusive Heuristic for Clustering a Virtual P2P Supercomputer. In Submitted to 7th High-Performance Grid Computing Workshop (HGCW’10) in conjunction with Intl. Parallel and Distributed Processing Symposium (IPDPS’10), 2010.&lt;/span&gt;&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://portal.acm.org/citation.cfm?id=1536414.1536449"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;[2] R. Andersen and Y. Peres. Finding sparse cuts locally using evolving sets. In Proceedings of the 41st annual ACM symposium on Theory of computing, pages 235–244. ACM, 2009.&lt;/span&gt;&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://www.springerlink.com/content/71l74772v0723j04/"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;[3] R. Gorke, T. Hartmann, and D. Wagner. Dynamic graph clustering using minimum-cut trees. Algorithms and Data Structures, pages 339–350.&lt;/span&gt;&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;a href="http://arxiv.org/abs/1004.1001"&gt;[4] Marko A. Rodriguez and Peter Neubauer. The graph traversal pattern. [book chapter in review] arXiv:1004.1001, AT&amp;amp;Ti and NeoTechnology, 2010.&lt;/a&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4782139802712335622-2139376882520026620?l=alexaverbuch.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://alexaverbuch.blogspot.com/feeds/2139376882520026620/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://alexaverbuch.blogspot.com/2010/04/me-my-names-alex-im-currently.html#comment-form' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4782139802712335622/posts/default/2139376882520026620'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4782139802712335622/posts/default/2139376882520026620'/><link rel='alternate' type='text/html' href='http://alexaverbuch.blogspot.com/2010/04/me-my-names-alex-im-currently.html' title='Thesis - Sharding the Neo4j Graph DB'/><author><name>Alex Averbuch</name><uri>http://www.blogger.com/profile/09276752613359599389</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_UNsxewwBXJc/S8SRpenbkaI/AAAAAAAAAAM/aaYoskdC_nQ/s1600-R/d68871d265748f5bb7efdfacb80f6308.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_UNsxewwBXJc/S8YYQlo8z_I/AAAAAAAAABA/JT-46hEi-CI/s72-c/architecture.png' height='72' width='72'/><thr:total>9</thr:total></entry></feed>
