Deploying the Aurelius Graph Cluster
October 17, 2012 2 Comments
The Aurelius Graph Cluster is a cluster of interoperable graph technologies that can be deployed on a multi-machine compute cluster. This post demonstrates how to set up the cluster on Amazon EC2 (a popular cloud service provider) with the following graph technologies:
Titan is an Apache2-licensed distributed graph database that leverages existing persistence technologies such as Apache HBase and Cassandra. Titan implements the Blueprints graph API and therefore supports the Gremlin graph traversal/query language. [OLTP]
Faunus is an Apache2-licensed batch analytics, graph computing framework based on Apache Hadoop. Faunus leverages the Blueprints graph API and exposes Gremlin as its traversal/query language. [OLAP]
Please note the date of this publication. There may exist newer versions of the technologies discussed as well as other deployment techniques. Finally, all commands point to an example cluster and any use of the commands should be respective of the specific cluster being computed on.
Cluster Configuration
The examples in this post assume the reader has access to an Amazon EC2 account. The first step is to create a machine instance that has, at minimum, Java 1.6+ on it. This instance is used to spawn the graph cluster. The name given to this instance is
agc-master
and it is a modest m1.small machine. On agc-master
, Apache Whirr 0.8.0 is downloaded and unpacked.
~$ ssh ubuntu@ec2-184-72-209-80.compute-1.amazonaws.com ... ubuntu@ip-10-117-55-34:~$ wget http://www.apache.org/dist/whirr/whirr-0.8.0/whirr-0.8.0.tar.gz ubuntu@ip-10-117-55-34:~$ tar -xzf whirr-0.8.0.tar.gz
Whirr is a cloud service agnostic tool that simplifies the creation and destruction of a compute cluster. A Whirr “recipe” (i.e. a properties file) describes the machines in a cluster and their respective services. The recipe used in this post is provided below and saved to a text file named
agc.properties
on agc-master
. The recipe defines a 5 m1.large machine cluster containing HBase 0.94.1 and Hadoop 1.0.3 (see whirr.instance-templates
). HBase will serve as the database persistance engine for Titan and Hadoop will serve as the batch computing engine for Faunus.
whirr.cluster-name=agc whirr.instance-templates=1 zookeeper+hadoop-namenode+hadoop-jobtracker+hbase-master,4 hadoop-datanode+hadoop-tasktracker+hbase-regionserver whirr.provider=aws-ec2 whirr.identity=${env:AWS_ACCESS_KEY_ID} whirr.credential=${env:AWS_SECRET_ACCESS_KEY} whirr.hardware-id=m1.large whirr.image-id=us-east-1/ami-da0cf8b3 whirr.location-id=us-east-1 whirr.hbase.tarball.url=http://archive.apache.org/dist/hbase/hbase-0.94.1/hbase-0.94.1.tar.gz whirr.hadoop.tarball.url=http://archive.apache.org/dist/hadoop/core/hadoop-1.0.3/hadoop-1.0.3.tar.gz hbase-site.dfs.replication=2
From agc-master
, the following commands will launch the previously described cluster. Note that the first two lines require specific Amazon EC2 account information. When the launch completes, the Amazon EC2 web admin console will show the 5 m1.large machines.
ubuntu@ip-10-117-55-34:~$ export AWS_ACCESS_KEY_ID= # requires account specific information ubuntu@ip-10-117-55-34:~$ export AWS_SECRET_ACCESS_KEY= # requires account specific information ubuntu@ip-10-117-55-34:~$ ssh-keygen -t rsa -P '' ubuntu@ip-10-117-55-34:~$ whirr-0.8.0/bin/whirr launch-cluster --config agc.properties
The deployed cluster is diagrammed on the right where each machine maintains its respective software services. The sections to follow will demonstrate how to load and then process graph data within the cluster. Titan will serve as the data source for Faunus’ batch analytic jobs.
Loading Graph Data into Titan
Titan is a highly scalable, distributed graph database that leverages existing persistence engines. Titan 0.1.0 supports Apache Cassandra (AP), Apache HBase (CP), and Oracle BerkeleyDB (CA). Each of these backends emphasizes a different aspect of the CAP theorem. For the purpose of this post, Apache HBase is utilized and therefore, Titan is consistent (C) and partitioned (P).
For the sake of simplicity, the 1 zookeeper+hadoop-namenode+hadoop-jobtracker+hbase-master
machine will be used for cluster interactions. The IP address can be found in the Whirr instance metadata on agc-master
. The reason for using this machine is that numerous services are already installed on it (e.g. HBase shell, Hadoop, etc.) and therefore, no manual software installation is required on agc-master
.
ubuntu@ip-10-117-55-34:~$ more .whirr/agc/instances us-east-1/i-3c121b41 zookeeper,hadoop-namenode,hadoop-jobtracker,hbase-master 54.242.14.83 10.12.27.208 us-east-1/i-34121b49 hadoop-datanode,hadoop-tasktracker,hbase-regionserver 184.73.57.182 10.40.23.46 us-east-1/i-38121b45 hadoop-datanode,hadoop-tasktracker,hbase-regionserver 54.242.151.125 10.12.119.135 us-east-1/i-3a121b47 hadoop-datanode,hadoop-tasktracker,hbase-regionserver 184.73.145.69 10.35.63.206 us-east-1/i-3e121b43 hadoop-datanode,hadoop-tasktracker,hbase-regionserver 50.16.174.157 10.224.3.16
Once in the machine via ssh, Titan 0.1.0 is downloaded, unzipped, and the Gremlin console is started.
ubuntu@ip-10-117-55-34:~$ ssh 54.242.14.83 ... ubuntu@ip-10-12-27-208:~$ wget https://github.com/downloads/thinkaurelius/titan/titan-0.1.0.zip ubuntu@ip-10-12-27-208:~$ sudo apt-get install unzip ubuntu@ip-10-12-27-208:~$ unzip titan-0.1.0.zip ubuntu@ip-10-12-27-208:~$ cd titan-0.1.0/ ubuntu@ip-10-12-27-208:~/titan-0.1.0$ bin/gremlin.sh \,,,/ (o o) -----oOOo-(_)-oOOo----- gremlin>
A toy 1 million vertex/edge graph is loaded into Titan using the Gremlin/Groovy script below (simply cut-and-paste the source into the Gremlin console and wait approximately 3 minutes). The code implements a preferential attachment algorithm. For an explanation of this algorithm, please see the second column of page 33 in Mark Newman‘s article The Structure and Function of Complex Networks.
// connect Titan to HBase in batch loading mode conf = new BaseConfiguration() conf.setProperty('storage.backend','hbase') conf.setProperty('storage.hostname','localhost') conf.setProperty('storage.batch-loading','true'); g = TitanFactory.open(conf) // preferentially attach a growing vertex set size = 1000000; ids = [g.addVertex().id]; rand = new Random(); (1..size).each{ v = g.addVertex(); u = g.v(ids.get(rand.nextInt(ids.size()))) g.addEdge(v,u,'linked'); ids.add(u.id); ids.add(v.id); if(it % 10000 == 0) { g.stopTransaction(SUCCESS) println it } }; g.shutdown()
Batch Analytics with Faunus
Faunus is a Hadoop-based graph computing framework. It supports performant global graph analyses by making use of sequential reads from disk (see The Pathologies of Big Data). Faunus provides connectivity to Titan/HBase, Titan/Cassandra, any Rexster-fronted graph database, and to text/binary files stored in HDFS. From the
1 zookeeper+hadoop-namenode+hadoop-jobtracker+hbase-master
machine, Faunus 0.1-alpha is downloaded and unzipped. The provided titan-hbase.properties
file should be updated with hbase.zookeeper.quorum=10.12.27.208
instead of localhost
. The IP address 10.12.27.208 is provided by ~/.whirr/agc/instances
on agc-master
. Finally, the Gremlin console is started.
ubuntu@ip-10-12-27-208:~$ wget https://github.com/downloads/thinkaurelius/faunus/faunus-0.1-alpha.zip ubuntu@ip-10-12-27-208:~$ unzip faunus-0.1-alpha.zip ubuntu@ip-10-12-27-208:~$ cd faunus-0.1-alpha/ ubuntu@ip-10-12-27-208:~/faunus-0.1-alpha$ vi bin/titan-hbase.properties ubuntu@ip-10-12-27-208:~/faunus-0.1-alpha$ bin/gremlin.sh \,,,/ (o o) -----oOOo-(_)-oOOo----- gremlin>
A few example Faunus jobs are provided below. The final job on line 9 generates an in-degree distribution. The in-degree of a vertex is defined as the number of incoming edges to the vertex. The outputted result states how many vertices (second column) have a particular in-degree (first column). For example, 167,050 vertices have only 1 incoming edge.
gremlin> g = FaunusFactory.open('bin/titan-hbase.properties') ==>faunusgraph[titanhbaseinputformat] gremlin> g.V.count() // how many vertices in the graph? ==>1000001 gremlin> g.E.count() // how many edges in the graph? ==>1000000 gremlin> g.V.out.out.out.count() // how many length 3 paths are in the graph? ==>988780 gremlin> g.V.sideEffect('{it.degree = it.inE.count()}').degree.groupCount // what is the graph's in-degree distribution? ==>1 167050 ==>10 2305 ==>100 6 ==>108 3 ==>119 3 ==>122 3 ==>133 1 ==>144 2 ==>155 1 ==>166 2 ==>18 471 ==>188 1 ==>21 306 ==>232 1 ==>254 1 ==>... gremlin>
To conclude, the in-degree distribution result is pulled from Hadoop’s HDFS (stored in output/job-0
). Next, scp is used to download the file to agc-master
and then again to download the file to a local machine (e.g. a laptop). If the local machine has R installed, then the file can be plotted and visualized (see the final diagram below). The log-log plot demonstrates the known result that the preferential attachment algorithm generates a graph with a power-law degree distribution (i.e. “natural statistics”).
ubuntu@ip-10-12-27-208:~$ hadoop fs -getmerge output/job-0 distribution.txt ubuntu@ip-10-12-27-208:~$ head -n5 distribution.txt 1 167050 10 2305 100 6 108 3 119 3 ubuntu@ip-10-12-27-208:~$ exit ... ubuntu@ip-10-117-55-34:~$ scp 54.242.14.83:~/distribution.txt . ubuntu@ip-10-117-55-34:~$ exit ... ~$ scp ubuntu@ec2-184-72-209-80.compute-1.amazonaws.com:~/distribution.txt . ~$ r > t = read.table('distribution.txt') > plot(t,log='xy',xlab='in-degree',ylab='frequency')
Conclusion
The Aurelius Graph Cluster is used for processing massive-scale graphs, where massive-scale denotes a graph so large it does not fit within the resource confines of a single machine. In other words, the Aurelius Graph Cluster is all about Big Graph Data. The two cluster technologies explored in this post were Titan and Faunus. They serve two distinct graph computing needs. Titan supports thousands of concurrent real-time, topologically local graph interactions. Faunus, on the other hand, supports long running, topologically global graph analyses. In other words, they provide OLTP and OLAP functionality, respectively.
References
London, G., “Set Up a Hadoop/HBase Cluster on EC2 in (About) an Hour,” Cloudera Developer Center, October 2012.
Newman, M., “The Structure and Function of Complex Networks,” SIAM Review, volume 45, pages 167-256, 2003.
Jacobs, A., “The Pathologies of Big Data,” ACM Communications, volume 7, number 6, July 2009.
Pingback: Faunus Provides Big Graph Data Analytics « Aurelius
Pingback: Big Graph Data on Hortonworks Data Platform « Aurelius