← The Social Graph of the Los Alamos National Laboratory

Powers of Ten – Part I

May 29, 2014 by Stephen Mallette 1 Comment

“No, no! The adventures first,’ said the Gryphon in an impatient tone: ‘explanations take such a dreadful time.”
— Lewis Carroll – Alice’s Adventures in Wonderland

It is often quite simple to envision the benefits of using Titan. Developing complex graph analytics over a multi-billion edge distributed graph represent the adventures that await. Like the Gryphon from Lewis Carroll’s tale, the desire to immediately dive into the adventures can be quite strong. Unfortunately and quite obviously, the benefits of Titan cannot be realized until there is some data present within it. Consider the explanations that follow; they are the strategies by which data is bulk loaded to Titan enabling the adventures to ensue.

There are a number of different variables that might influence the approach to loading data into a graph, but the attribute that provides the best guidance in making a decision is size. For purposes of this article, “size” refers to the estimated number of edges to be loaded into the graph. The strategy used for loading data tends to change in powers of ten, where the strategy for loading 1 million edges is different than the approach for 10 million edges.

Given this neat and memorable way to categorize batch loading strategies, this two-part article outlines each strategy starting with the smallest at 1 million edges or less and continuing in powers of ten up to 1 billion and more. This first part will focus on 1 million and 10 million edges, which generally involves common Gremlin operations. The second part will focus on 100 million and 1 billion edges, which generally involves the use of Faunus.

1 Million

Gremlin Ten to the Sixth In the range of millions of edges or less, there really isn’t a particular loading strategy to follow, as the graph will likely fit in memory and the load time will be reasonably fast. Mistakes at this scale are not as costly because problems are typically straightforward to diagnose and the graph can be reloaded from scratch without much cost in terms of time.

As explained in a previous blog post entitled Polyglot Persistence and Query with Gremlin, the Gremlin REPL is a flexible environment for working with any kind of data. Obviously, it provides access to graph databases like Titan, but within the same REPL session, it is possible to also connect to a relational database, reach out to a web service, read a file, etc. Given this capability, writing a Gremlin script that can be executed through the REPL is likely the most lightweight and direct manner to get data into a graph.

Wikipedia Vote Network Schema The Wikipedia Vote Network is a data set that “contains all the Wikipedia voting data from the inception of Wikipedia until January 2008. Vertices in the network represent Wikipedia users and a directed edge from vertex i to vertex j represents that user i voted on user j.” Within its basic tab-delimited data structure, it contains 7,115 vertices and 103,689 edges, making it a good size for this demonstration.

All of the examples in this post, assume that the latest version of Titan is downloaded and unpackaged (titan-all packaging is required for the examples). From Titan’s root directory download and unpackage the Wikipedia Vote Network data set:

$ curl -L -O http://snap.stanford.edu/data/wiki-Vote.txt.gz
$ gunzip wiki-Vote.txt.gz

Unzipping the archive will create the wiki-Vote.txt file in the root of Titan directory. The following Gremlin script demonstrates how to load that file into Titan (backed by BerkleyDB):

g = TitanFactory.open('/tmp/1m')
g.makeKey('userId').dataType(String.class).indexed(Vertex.class).unique().make()
g.makeLabel('votesFor').make()
g.commit()

getOrCreate = { id ->
  def p = g.V('userId', id)
  if (p.hasNext()) ? p.next() : g.addVertex([userId:id])
}
 
new File('wiki-Vote.txt').eachLine {
  if (!it.startsWith("#")){
    (fromVertex, toVertex) = it.split('\t').collect(getOrCreate)
    fromVertex.addEdge('votesFor', toVertex)
  }
}
 
g.commit()

The key portions of the data loading script to pay attention to are as follows:

g.makeKey(‘userId’)… – Create types in Titan first. In this case, the schema only consists of a userId that will be on each user vertex. Always commit at the end of type creation and before loading data to the graph instance.
getOrCreate = { id ->... – A helper function that takes a vertex identifier (i.e. userId) as an argument and does an index look-up to determine if the vertex already exists. If it does exist the vertex is returned, but if it does not exist it is created. The concept of getOrCreate is a common one and having an efficient helper function to perform this task will almost always be necessary.
new File('wiki-Vote.txt').eachLine { – Read the source data file line by line executing the supplied closure for each one.
if (!it.startsWith("#")){ – The file contains comments which are identified by a # at the start of the line. These lines should be ignored.
(fromVertex, toVertex) = it.split('\t').collect(getOrCreate) – Each line consists of a pair of tab delimited values. This code splits the line of text on the tab to create a list containing two userId values. The getOrCreate function is applied over those values with collect and the resulting list is then destructured into two variables containing vertices in the graph that either already existed or were otherwise newly created: fromVertex and toVertex.
fromVertex.addEdge('votesFor', toVertex) – Constructs the edge between the two vertices.
g.commit() – It is worth noting that this load is performed within the context of a single transaction. At the higher end of the 1 million edge range, it will be necessary to perform intermediate commits during that process.

To execute this script, copy it into a file called load-1m.groovy at the root of the Titan install directory. Please note that the script will generate the Titan database on the file system at /tmp/1m. Start Gremlin with bin/gremlin.sh. When the REPL has initialized execute the script as follows:

$ bin/gremlin.sh

         \,,,/
         (o o)
-----oOOo-(_)-oOOo-----
gremlin> \. load-1m.groovy
==>titangraph[local:/tmp/1m]
==>userId
...
==>null
gremlin> g.V.count()
==>7115
gremlin> g.E.count()
==>103689

The Wikipedia Vote Network has a simple schema. Even at the scale of 1 million edges, the complexity of a batch loading script can only rise from this case. The loading script in this section provides for a good skeleton on which more complicated loads can be fleshed out.

10 Million

Gremlin Ten to the Seventh The approach to loading tens of millions of edges isn’t so different from the previous section. A Gremlin script is still the most direct approach to loading, however there are some differences to consider. The most important of these differences is the use of BatchGraph, which handles intermediate commits of transactions at specified intervals and maintains a vertex cache for fast retrieval. Please refer to the BatchGraph documentation for important information on the limitations of its usage.

The DocGraph data set “shows how healthcare providers team to provide care”. Vertices in this network represent healthcare providers, where they are identified by an NPI number. Edges represent shared interactions between two providers with three properties that further qualify that interaction. The data is partitioned into several sizes based on time windows. This section will utilize the “30-day Window”, which consists of approximately 1 million vertices and 73 million edges.

DocGraph Schema

From Titan’s root directory download and unpackage the DocGraph data set:

$ curl -L -O http://downloads.cms.gov/foia/physician-referrals-2012-2013-days30.zip
$ unzip physician-referrals-2012-2013-days30.zip && rm physician-referrals-2012-2013-days30.zip
$ head -n3 Physician-Referrals-2012-2013-DAYS30.txt

$ sort Physician-Referrals-2012-2013-DAYS30.txt > Physician-Referrals-2012-2013-DAYS30-sorted.txt

Unzipping the archive will create the Physician-Referrals-2012-2013-DAYS30.txt file in the root of Titan directory. Unlike the case in the previous section, the data is pre-sorted by the NPI number of what will be the out vertex for each edge. Pre-sorting the data will help improve the performance of BatchGraph, as writes to and flushes of the cache are reduced. The following Gremlin script demonstrates how to load that file into Titan (backed by BerkleyDB):

conf = new BaseConfiguration() {{
  setProperty("storage.backend", "berkeleyje")
  setProperty("storage.directory", "/tmp/10m")
  setProperty("storage.batch-loading", true)
}}
 
g = TitanFactory.open(conf)
g.makeKey("npi").dataType(String.class).single().unique().indexed(Vertex.class).make()
sharedTransactionCount = g.makeKey("sharedTxCount").dataType(Integer.class).make()
patientTotal = g.makeKey("patientTotal").dataType(Integer.class).make()
sameDayTotal = g.makeKey("sameDayTotal").dataType(Integer.class).make()
g.makeLabel("shares").signature(sharedTransactionCount, patientTotal, sameDayTotal).make()
g.commit()
 
bg = new BatchGraph(g, VertexIDType.STRING, 10000)
bg.setVertexIdKey("npi")
 
c = 0L
new File("Physician-Referrals-2012-2013-DAYS30-sorted.txt").eachLine({ final String line ->
 
    def (id1,
         id2,
         sharedTransactionCount,
         patientTotal,
         sameDayTotal) = line.split(',')*.trim()
 
    def v1 = bg.getVertex(id1) ?: bg.addVertex(id1)
    def v2 = bg.getVertex(id2) ?: bg.addVertex(id2)
    def edge = bg.addEdge(null, v1, v2, "shares")
    edge.setProperty("sharedTxCount", sharedTransactionCount as Integer)
    edge.setProperty("patientTotal", patientTotal as Integer)
    edge.setProperty("sameDayTotal", sameDayTotal as Integer)
 
    if (++c%100000L == 0L) println "Processed ${c} edges"
 
})
 
bg.commit()

The anatomy of this script is as follows (it can be executed in the Gremlin REPL with the instructions supplied in the previous section):

setProperty("storage.batch-loading", true) – Enabling “batch loading” for Titan will help improve performance by disabling consistency checks and locking. Read more about this option and other settings that can affect batch loading in the Titan documentation.
g.makeKey("npi")... – As in the previous example at the 1 million edge scale, types should be created and committed first.
bg = new BatchGraph(g, VertexIDType.STRING, 10000) – Wrap the TitanGraph instance in a BatchGraph, define the data type of the identifier which in this case for the NPI number is a STRING, and set the transaction size to 10000. With this setting, BatchGraph will automatically commit transactions on every 10,000 mutations to the graph.
bg.setVertexIdKey("npi") – Tells BatchGraph that the vertex identifier will be stored in a vertex property key called npi.
...sameDayTotal) = line.split(',')*.trim() – Each line in the file consists of a pair of comma delimited values. This line splits the line of text on the comma to create a list containing five values destructured to five variables.
def v1 = bg.getVertex(id1) ?: bg.addVertex(id1) – BatchGraph helps simplify the getOrCreate function from the previous section. BatchGraph overrides the default functionality of addVertex and getVertex allowing a specification and look-up of a vertex by the NPI number. If the vertex is not found, getVertex will return null and the vertex will have to be added.
bg.commit() – With all loading complete, make a final call to commit to finalize any remaining elements in the transaction buffer.

The DocGraph example demonstrated the key strategies for loading tens of millions of edges, which in summary are: pre-process data when possible to ease loading and improve performance and use BatchGraph to allow focus on the data being loaded as opposed to loading mechanics, such as manually batching commits, handling identifiers, and writing getOrCreate methods.

Some other strategies and ideas to consider at this scale include:

Do programming and testing of load scripts with a subset of the data to improve development cycle time.
Use third-party libraries to to be more productive and reduce the amount of code to be written (e.g. groovycsv).
Consider methods of parallel loading using gpars, if the data is such that it can be organized to allow for that.
If there is an inclination to load data from a non-JVM language, like Python, reconsider this article and write the load script in Gremlin. In this way, the process of loading data can be completed quickly, allowing focus on language-specific tooling (e.g. Bulbs) for Python application development).

The tens of millions size falls into the realm of “Boutique Graph Data“, where many applications will fit or at least be where most applications will start. In that sense, this size is perhaps one of the more relevant sizes in the “Powers of Ten.”

Conclusion

This article explored the lower scales of loading data to Titan. At the millions and tens of millions of edges scales, a Gremlin script and the REPL is generally all that is required for batch-loading activities. For those just getting started with TinkerPop and Titan, taking this approach means having to learn the least amount of the stack in order to get started. Being comfortable at this scale of loading is critical to being successful at the hundreds of millions and billions of edges scales to be described in the next part of this article series.

Acknowledgments

Dr. Vadas Gintautas originally foresaw the need to better document bulk loading strategies and that such strategies seemed to divide themselves nicely in powers of ten.

Authors

Filed under Blog

One Response to Powers of Ten – Part I

Vadas Gintautas says:

May 29, 2014 at 6:12 am

Nice post, and thanks for the nod!

best, V