JanusGraph – Distributed, open source, scalable graph database

mrdoops · on July 7, 2021

A lot of comments not sure about what Graph DBs are good for:

* Flexible knowledge association i.e. Knowledge Graphing

* Modeling and querying associations / models with many-steps-removed requirements

* Expert Systems / Inference Engines

* Lazy traversal for complex job scheduling

Graph DBs are not good at being a general purpose 95% of use cases database. Just use Postgres/MySQL if you're not sure. We use Neptune (AWS managed GraphDB) to model cybersecurity dependencies between many companies and report on supply chain vulnerabilities many steps removed. Those kinds of queries are non-trivial and expensive on anything but a Graph Database.

As GraphDBs meet niche query requirements you usually have other databases involved in the full application. If you want to tractably manage many databases in a system you ideally want to be in streaming / event sourced semantics. If you're already in an imperative crud-around-data / batch pipeline you'll find greater maintenance costs in adopting a GraphDB or any additional DB for that matter.

jshen · on July 7, 2021

I have yet to see streaming/event sourcing work well. Every time I’ve seen it used it’s caused more problems than it’s resolved. The main problems, out of order events and/or slow to propagate events.

mrdoops · on July 7, 2021

From a technical perspective there are plenty of ways to serialize / order some stream of events in a reasonable way. Whether that's implementing your event store on top of a transactionally secure database (e.g. Postgres, not Mongo) or using a higher throughput, less persistently secure solution like Kafka. There's also ways to deal with eventual consistency and distributed transactions (SAGAs) depending on the need.

The hard part is that ES/Streaming systems work best / almost necessitate a clean and clear domain model. A clean and clear domain model requires a lot of discussion and consensus with domain experts and product owners. Buy in to have the kind of discussions needed is the source of the issues I've experienced with these kind of systems. CRUD can paint over a lot of cloudy abstract concepts for better or for worse. These kind of discussions are energy intensive and mentally painful to cast light on the cloudy thoughts.

There's not great streaming/es support on a language level outside of the robust actor model systems (e.g. Erlang/Elixir). There are systems like Akka that simulate that to some extent on runtimes like JVM, but a cooperative scheduler and an actor model don't mix great. For non-actor model aspects I've been seeing more service level dataflow systems like KSQL / MaterializeDB gain traction, but are nevertheless a solution for read-models not application logic.

jshen · on July 7, 2021

In short, making streaming architectures work require great central planing, a grand architect in the sky, or they fail miserably. I think that is an argument against streaming architectures ;)

mrdoops · on July 7, 2021

I think that's a bit of a reduction/straw man to say they need a lot of architecture/central-planning. An individual developer can do it so long as they have access to domain experts to ask the right questions. Lack of proper planning will result in any architecture failing miserably - best not to code until we know what to code.

The implementation overhead as far as code-to-write for streaming architectures is comparable to CRUD. But there is less knowledge dispersed about practices on how to do it, so there is the cost of learning. It is more cutting edge after all.

jshen · on July 7, 2021

It is admittedly an oversimplification for affect, but it’s more or less what you said. Maybe I’ve worked at bigger companies, we have over 1000 devs in my division, and trying to get them all aligned is very very difficult. I think central planning is a good analogy for that. The architecture that requires less central planning/coordination will likely do better in such a dynamic.

mrdoops · on July 7, 2021

I expect streaming will be adopted like anything - first in smaller more mobile teams working in greenfield contexts, then later in large organizations when the idea is less novel and lower risk/cost to apply at scale.

Regardless I feel streaming has been around long enough now it should be urgently considered anywhere where the words "data pipeline" and "scale" are thrown around together frequently.

jshen · on July 7, 2021

I’ve implemented it several times, and it hasn’t been clearly better. Don’t get me wrong, the alternatives haven’t been great either. I think it’s just hard when you have large organizations, large number of systems some of which are decades old, and you have high volumes of interactions/transactions.

fulafel · on July 8, 2021

There are many other applications that work well with graph / triple store style approach to data. See eg Datomic and the several Datomic inspired databases (like Crux, Datahike, Datalevin, etc) used as general work horses in the Clojure world.

I'd recommend trying non relational databases even if you're not sure.

freewilly1040 · on July 7, 2021

Graph databases have been the great white whale at my org for a number of years. We gave a crack at Janus a while back. It (like a few attempts at Neo4J) failed to deliver on the promise of unlocking queries with more than a hop or two, while dramatically underperforming on those one or two hop queries vs a graph implemented in MySQL.

im_down_w_otp · on July 7, 2021

We kind of ran into a similar problem. Our core data (multi-context execution/event traces) is fundamentally graph shaped, but it's also among the pathologically poor graph structures for most general purpose graph databases and their associated query DSLs and execution planners/optimizers, so we had to build a solution that was tailored more for our domain (verifying system properties of complex interacting components).

Which I suppose kind of typifies the problem. Graph databases are fantastic because they let you flexibly and coherently model practically anything. But, perhaps principally because of this, they can become an impediment once you better understand the nuances and idiosyncrasies of your domain, and thus need something that has more optimal (or perhaps predictable) performance for the kinds of questions you know you need to ask over a representation of your domain/data that you know is sufficient?

tschellenbach · on July 7, 2021

I also don't see many valid use cases for graph databases.

freewilly1040 · on July 7, 2021

Marketing and fraud detection are pretty valid imo. Just inherently hard to scale.

I do think there's a valid question of how useful n-hop queries are for an N that is greater than 2 or 3.

alexott · on July 7, 2021

There is a lot - customer support (aka customer 360), fraud detection, some maintenance cases, inventory, recommendations, etc. But it’s heavily dependent on requirements - what should be response time. Most of graph databases are good in fast response time, but will do it only for 2-3 hops from known start points. For many other things, graph analytics with Spark or something like could be better

pphysch · on July 7, 2021

2-3 hops (JOINs) is well within RDBMS territory, no?

freewilly1040 · on July 8, 2021

Yes. But it’s not a very exciting solution so people keep trying graph db’s.

jshen · on July 7, 2021

They are great for data exploration, data science, analytics, etc. I would NOT put one as a dependency on a user experience though.

tejtm · on July 7, 2021

Have you tried getting grants promising they will deliver? </s>

rektide · on July 7, 2021

> I also don't see many valid use cases for graph databases.

It's the most general purpose means I can see to model entities. I can't see many invalid uses.

rektide · on July 7, 2021

There were a couple years where it was mostly abandoned. Good to see this solid graph database being well maintained. the milestones[1] mostly show a lot of upgrading libraries, some enhancements/features sprinkled in, but for a while Janus was nearly abandoned.

Maintenance re-started in 2017, with IBM & Google stepping up to back it[2].

[1] https://github.com/JanusGraph/janusgraph/milestones?state=cl...

[2] https://architecht.io/google-ibm-back-new-open-source-graph-...

speedgoose · on July 7, 2021

Sorry to ask about it, but while deciding the name of your Java graph database, how did you ended up with anus?

NewJazz · on July 7, 2021

https://en.m.wikipedia.org/wiki/Janus

speedgoose · on July 7, 2021

Yes but a lot of Java projects start with a J and they must have thought about it.

Graphguy · on July 7, 2021

Detailed read from two contributors on the project- https://www.ibm.com/cloud/blog/database-deep-dives-janusgrap...

antpls · on July 10, 2021

To me, the downside of graph db is the non-standardized query language.

I tried Gremlin but it feels like an imperative DSL. Cypher queries are more readable, but are limited to Neo4J. I am looking forward for Open Cypher or maybe a variation of Facebook GraphQL.

axiosgunnar · on July 8, 2021

Would it make sense to use a graph DB to model file hierarchies (folders and subfolders) at scale?

(hundreds of thousands of folders)

The idea would be to use a graph DB for a first query to get the file ids in scope (all files inside a given folder and its subfolders) before running the actual SQL query, eg creation_date < foo AND file_id in [array from graph db output]

AtNightWeCode · on July 7, 2021

[flagged]

amitport · on July 7, 2021

Elaborate?

AtNightWeCode · on July 7, 2021

[flagged]

bryanrasmussen · on July 7, 2021

>Graph databases cause expensive writes and cheap reads. Something anybody typically never ever needs.

I find this statement really surprising, just about every application not dealing with money I've ever been on has had lots more reads than writes and would thereby benefit if the reads were cheap - obviously nobody wants expensive writes but if the benefit is cheap reads and the expensiveness of writes can be dealt with by batching etc. I guess it's an acceptable tradeoff.

staticassertion · on July 7, 2021

First of all, "expensive writes and cheap reads" is the common case. Second, nothing about graph databases implies expensive writes and cheap reads.

AtNightWeCode · on July 7, 2021

This is the ONLY reason to use a graph databases.... Go a head and down vote instead of learn something. :)

staticassertion · on July 7, 2021

That is not the only reason to use a graph database, and you're not exactly making a compelling argument, or a coherent one. I doubt I have much to learn from you on the subject.

boxed · on July 7, 2021

That's a bit much. I have a hobby project that is the graph of all taxons. It is not write heavy, to say the least. (I use mysql for it but still :))

AtNightWeCode · on July 7, 2021

Pretty much common knowledge...

Doc db -> very cheap reads / very cheap writes

Sgl db -> cheap reads / expensive writes

Graph db -> cheap reads -> very expensive writes

...for the type of data.

spinningslate · on July 7, 2021

If it were as simple as that, then surely the whole world would be using document DBs exclusively.

The whole world isn't doing that. So maybe there's more to it.

I'm doing some work right now where a graph is a good conceptual fit to the problem space. Writes are much less common than reads. A graph-theoretic approach is a good fit for the queries it needs to support; transitive closure and topological sort for example.

Would I use a graph DB for a heavily transactional system like banking? No. Different problem, different requirements, different tech choice.

But you seem to be suggesting that there are no problem scenarios where the characteristics of a graph DB are a good fit. That seems naive at best.

AtNightWeCode · on July 7, 2021

[flagged]

staticassertion · on July 7, 2021

This is practically parody...

I'm the founder of a company building a graph product. I've talked to numerous researchers in the field, read papers on the subject, I've even reviewed PHD candidates thesis for universities. I routinely field offers for consulting explicitly on graph database technology.

You haven't said anything of substance and the little you have said has only served to convey your own ignorance on the subject. I've flagged a few of your posts since they break HN rules.

AtNightWeCode · on July 8, 2021

I worked with graph databases. I also worked with graph like queries in SQL databases a lot for investigating money laundering, flow of payments between users and so on. In single systems that handles billions of Euros. This is easily done in SQL databases and has been since the eighties.

In general the only reason to use a graph database as I wrote is when you want to trade expensive writes for cheap ad hoc graph queries. So if you are investigating international money laundering with a lot of different types of relationships between different types of entities and it is significant how the entities are related then SQL can be a poor fit. Or at least, you probably need to model a lot of the relationships as entities and keep on evolving the model with new data. With graph databases you could just enter all the data from different sources as nodes and edges and then perform any query on that data still with good performance. At least in theory. In practice it can be difficult to model graph databases too.

The parody is that people suddenly thinks that there is a general need for graph databases.

spinningslate · on July 7, 2021

Please refrain from personal attacks, it doesn't contribute to the discussion. And it's not consistent with the community guidelines [0].

> ALL databases have tradeoffs

That was exactly my point.

[0] https://news.ycombinator.com/newsguidelines.html

arthurcolle · on July 7, 2021

How large were your datasets?