Janusgraph : Starter Read

Prerequisite:

–          https://sonra.io/2017/06/12/benefits-graph-databases-data-warehousing/

–          [optional] https://www.datastax.com/blog/2013/11/letter-regarding-native-graph-databases

–          There are typically two types of graph system vendors:

  1. OLTP graph databases
  2. OLAP graph processors

JanusGraph: A Graph DB

It is designed to support the processing of graphs so large that they require storage and computational capacities beyond what a single machine can provide. Scaling graph data processing for real-time traversals and analytical queries is JanusGraph’s foundational benefit.

Benefits of JanusGraph:

It promises support for a wide variety of open source

  • storage back-ends, (Cassandra, HBase, Oracle BerkeleyDB)
    •  Use CAP theorem to decide DB, one has to choose between Consistency (HBase) & Availability (Cassandra)
  • analytics engines  (Spark’s GraphX, Flink’s Gelly)
  • and search engines (ElasticSearch, Solr)

–          Scales really well the number of machines & very performant with concurrent transactions and operational graph processing.

–          Support for geo, numeric range, and full-text search for vertices and edges on very large graphs.

–          Native support for the popular property graph data model exposed by Apache TinkerPop.

  • TinkerPop provides an abstraction over different graph databases and graph processors allowing the same code to be used with different configurable back-ends.

–          Native support for the graph traversal language Gremlin.

  • Gremlin is a functionaldata-flow language that enables users to succinctly express complex traversals on (or queries of) their application’s property graph.
  • Gremlin was designed according to the “write once, run anywhere”-philosophy.
  • The benefit is that the user does not need to learn both a database query language and a domain-specific BigData analytics language (e.g. Spark DSL, MapReduce, etc.). Gremlin is all that is required to build a graph-based application because the Gremlin traversal machine will handle the rest.

–          Vertex-centric indices provide vertex-level querying to alleviate issues with the infamous supernode problem.

–          Open source under the liberal Apache 2 license.

What should people know when deciding between Neo4j and JanusGraph?

–          Neo4j Community Edition uses the GNU General Public License, which has more restrictive requirements on distributing software. Many developers eventually need the scaling and availability features that are only available in the Neo4j Enterprise Edition, which requires a commercial subscription license.

–          Neo4j is mostly a project that is kind of self-contained. What I mean by this is that it implements its own storage engine, indices, a server component, network protocol, and query language.

–          Neo4j mostly promotes their own query language—Cipher

–          TinkerPop is compatible with many other vendors, including Amazon Neptune, Microsoft Azure Cosmos DB, and DataStax Enterprise Graph, although keep in mind that many of the TinkerPop implementations are not free to open source.

–          Source: https://www.ibm.com/cloud/blog/new-builders/database-deep-dives-janusgraph

Graph Visualization:

Recommendations: https://github.com/JanusGraph/janusgraph/wiki/Tools

–          https://github.com/cytoscape/cytoscape.js / https://cytoscape.org/index.html

–          https://github.com/bricaud/graphexp

Limitations of JanusGraph:

Getting Started

Important Tutorial: ( works only with Janusgraph 0.1 )

https://github.com/marcelocf/janusgraph_tutorial



JanusGraph Connection methods: 

EmbeddedRemote
ConfigurationFactoryJanusFactoryRemote client
FasterSupports Transactions with control over Commit & Rollback operations,Avoids network trips as the server run locally.Slowerevery statement runs “atomic”requires additional network trips.
(when cache is enabled; cache.db-cache = true )
If any Janusgraph server is running & you run a new Janusgraph server to modify the schema, it fails due to “locked” keyspace.
Cannot instantiate server, it connects to already running server.
Support only JavaClients from a variety of languages can access Janusgraph.
JanusGraph & Application are tightly bound, that’s the reason this approach is called “embedded” You can scale JanusGraph, backend, and application independently of one another.

Examples: https://github.com/JanusGraph/janusgraph/tree/master/janusgraph-examples
Comparision: https://groups.google.com/forum/m/#!topic/janusgraph-users/t7gNBeWC844

JanusGraph Indexing: 

https://docs.janusgraph.org/index-management/index-performance/
There are major 2 types of indices in Janusgraph:

1. Graph

a. Composite 
     – supports only exact match, fast
     – stored in Cassandra only

b. Mixed
    – support many comparison predicates, slower than Composite
    – requires indexing backend like ES, Solr


2. Vertex-Centric

  • these are local index structures built individually per vertex.
  • helps to reduce edges fetched for processing
  • support equality and range/interval constraint
  • can be created on only native types
  • requires indexing backend like ES, Solr

Janusgraph Performance
https://www.experoinc.com/post/have-you-had-your-janusgraph-tuneup

How to avoid duplicate data in Graph?

Florian Hockmann has provided an excellent explanation; https://www.gdatasoftware.com/blog/2018/12/31280-how-to-avoid-doppelgangers-in-a-graph-database

Takeaways:

  1. Perform Get-or-Create Traversal in a single step, so that it just reduces the race condition time window with the omitted additional network round-trip between application and the JanusGraph Server.
  2. JanusGraph instance caching should be disabled.
  3. Schema having not only unique constraints on properties but also multiplicities other than MULTI on edge labels.

Get-or-Create Traversal example:

// This import is necessary to resolve unfold, addV methods
import static org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.__.*
g.V().has("code","XYZ").fold().coalesce(unfold(),addV().property("code","XYZ"))

Notes:

  • Property keys used on edges and properties have cardinality SINGLE. Attaching multiple values for a single key on an edge or property is not supported.
  • Cassandra Thrift protocol is deprecated and will be removed with JanusGraph 0.5.0. Please switch to the CQL backend.
  • When a vertex is deleted its incident edges are also deleted.
  • drop().iterate() succeeds even if Vertex/Edge does not exists
  • For the List/Set cardinality property:

– To mutate a LIST/SET value, you have to delete it first and add it again,
– The behavior in Titan is slightly different, so my suggestion is to drop() the property first, then add new items to it afterward.
– use a separate transaction for addition and deletion operations. 
– https://stackoverflow.com/questions/45993742/efficient-way-to-replace-set-list-property-in-janusgraph


Is it preferable to use bytecode or gremlin script?

Bytecode is the recommended way and Gremlin scripts might go away in a future version. I would consider them a legacy way of sending traversals to the server for execution. Bytecode is a nice way to serialize a traversal and it allows for Gremlin Language Variants (GLVs) in different programming languages that provide a really nice way to write Gremlin directly in the language of your choice, not only Java/Groovy, but also C#, Python or JavaScript


Count Queries:
It is recommended to use “direct index queries” instead of Gremlin count.
https://docs.janusgraph.org/index-backend/direct-index-query/#query-totals


Batch Loading:
https://docs.janusgraph.org/advanced-topics/bulk-loading/

optimization ways:
– enable batch loading
– disable consistency checks
– disable external and internal vertex checking
– commit your transactions per every 10-20k created vertices
– try ScyllaDB
– increase id allocation to 1 million
– use parallel threads with parallel transactions

https://groups.google.com/forum/m/#!msg/janusgraph-users/T7wg_dKri1g/FteDmMIeAgAJ


Extending my analysis on batch loading from ticket https://jira.intranet.qualys.com/browse/QBAS-1050;
As per https://docs.janusgraph.org/advanced-topics/bulk-loading/#configuration-options;

MutabilityConfiguration PlaceConfigDefault ValueDescription
LOCALCan only be provided through a local configuration filestorage.batch-loadingfalseEnabling batch loading disables JanusGraph internal consistency checks in a number of places. Most importantly, it disables locking
GLOBAL_OFFLINECan only be changed for the entire database cluster at once when all instances are shut downids.block-size 10000Set ids.block-size to the number of vertices, you expect to add per JanusGraph instance per hour.
 MASKABLEglobal but can be overwritten by a local configuration filestorage.buffer-size 1024JanusGraph buffers write and execute them in small batches to reduce the number of requests against the storage backend. This setting controls the size of these batches.

Notes on batch loading:

  1. For an embedded connection (JanusFactory):
    JanusGraphTransaction tx = graph.buildTransaction().enableBatchLoading().start();
  2. For Remote connection:
    We need to set storage.batch-loading=true in “janusgraph-cql-es-server.properties” files; which is used to start the Gremlin-Server.
    Any change requires a restart of JanusGraph instance to take effect
  3. Batch loading disables all data integrity checks if defined in a schema
  4. Indexing is not impacted however, for write optimization it is suggested to increase Elasticsearch’s refresh interval. https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html#_unset_or_increase_the_refresh_interval
  5. Benchmarks:
    VM Config: 
    OS: CentOS 7 (64 bit)
    RAM: 8 GB
    Java: OpenJDK version “1.8.0_232”
    Disk: SSD

    TestJanusFactoryRemote1. Without enabling batch loading;
    Create 1 million nodes with UUID.204 seconds
    Very Slow

    More than 5 mins ( had to stop the test )

    2. Batch loading enabled with below
    additional setting;
    storage.batch-loading=true
    ids.block-size=1000000
    storage.buffer-size=4096
     
    Commit a transaction at the end.97 secondsA transaction is not supported.

    More than 5 mins ( had to stop the test )3. Batch loading enabled with below
    additional setting;
    storage.batch-loading=true
    ids.block-size=1000000
    storage.buffer-size=4096
     
    Commit transaction after every 10K records77 secondsA transaction is not supported.

    More than 5 mins ( had to stop the test )4. Without enabling batch loading; 

    Load 20K nodes with a rate of 2K requests per second70 seconds 300 seconds5. Batch loading enabled with below
    additional setting;
    storage.batch-loading=true
    ids.block-size=1000000
    storage.buffer-size=4096

    Load 20K nodes with a rate of 2K requests per second 63 seconds284 seconds  
    Observation:
    – An embedded connection is way faster than Remote.

  6. After inserting 1 million nodes; count query with timeout
    gremlin> g.V().count(); 
    Evaluation exceeded the configured ‘evaluationTimeout’ threshold of 30000 ms or evaluation was otherwise cancelled directly for request [g.V().count();] – try increasing the timeout with the :remote command Type ‘:help’ or ‘:h’ for help. Display stack trace? [yN]
  7. Collecting performance metrics https://docs.janusgraph.org/advanced-topics/monitoring/ a
    There 2 ways to enable;
    a. config “metrics.enabled = true”
    b. management.set(“metrics.enabled”, true);

Mixed index

As per the docs https://docs.janusgraph.org/index-management/index-performance/#graph-index https://docs.janusgraph.org/index-management/index-performance/#mixed-index

  • Composite indexes are very fast and efficient but limited to equality
  • Mixed indexes provide more flexibility than composite indexes and support additional condition predicates beyond equality. Mixed indexes are slower for most equality queries than composite indexes

Search Conditions supported JG: https://docs.janusgraph.org/index-backend/search-predicates/

On the contrary to documentation;

  1. I could run all types of [queries |https://docs.janusgraph.org/index-backend/search-predicates/#query-examples] without Mixed index.
  2. Not equality operations like “lt, gte, range, regex, textContains” were working on Composite index

To create a Mixed index:

The definition refers to the indexing backend name “search” so that JanusGraph knows which configured indexing backend it should use for this particular index. The “search” parameter specified in the buildMixedIndex call must match the second clause in the JanusGraph configuration definition like this: “index.search.backend”

  • In the code we need to set below;

mixedIndexConfigName = “search”;  // this shuld be same as second clause in JG Config eg. “index.search.backend”

s.append(“management.buildIndex(\”vAge\”, Vertex.class).addKey(age).buildMixedIndex(\”” + mixedIndexConfigName + “\”); “);
s.append(“management.buildIndex(\”eReasonPlace\”, Edge.class).addKey(reason).addKey(place).buildMixedIndex(\”” + mixedIndexConfigName + “\”); “);

Todo:

  1. Custom DataType (UDT) https://docs.janusgraph.org/basics/common-questions/#custom-class-datatype / https://docs.janusgraph.org/advanced-topics/serializer/
  2. OLAP https://docs.janusgraph.org/advanced-topics/hadoop/
  3. ConfiguredGraphFactory https://docs.janusgraph.org/basics/configured-graph-factory/#overview