Worked on GemFire prior to Pivotal acquisition (before Geode) and currently work for SnappyData.
You can imagine GemFire/Gridgain as an apples-to-apples comparison. Both are "enterprise" in-memory data grids originally intended for managing data in low-latency OLTP applications which later added analytics/OLAP features. Geode/Ignite are the open source options for these two IMDGs and also a good apples-to-apples comparison. (Hazelcast also has enterprise/OSS verisons I would compare accordingly)
I can't speak to the current comparison between these systems, but I can compare them to SnappyData. SnappyData deeply integrates GemFire with Spark to bring high concurrency, high availability and mutability to Spark applications. In the world of combining Spark with a datastore over a connector (cassandra, hive, mysql, mongo etc) to enable "database-like" features in Spark, SnappyData has taken the next step of integration. In Snappy, the database (GemFire) and the Spark executors share the same block manager and VM so the systems no longer communicate over a "connector." This, along with our database optimizations, provides the best performance for Spark applciations in what I like to call the "Spark Database Ecosystem."
As such, comparing SnappyData to GemFire/Hazelcast/Gridgain does not make much sense unless you are trying to use Spark in conjunction with these systems. In that case, the main difference I would point out is that SnappyData will necessarily perform better as any of them would need to use a connector to interact with Spark. The better comparison would be between SnappyData and Ignite, as Ignite contains a direct Spark abstraction called "IgniteRDD." That said, the majority of the comparisons/benchmarks we've run have been against MemSQL+Spark and Cassandra+Spark, so I don't have much to say about Ignite vs SnappyData.
User manigandham mentions SnappyData's Approximate Query Processing features (called Synopses Data Engine) which is unique within this space, but a discussion of which would take this too far afield.
SnappyData employee here -- This is essentially what we did. The main difference is that we already had a decade old transactional K/V store, that, over time morphed into a more full fledged in-memory database. That is what we integrated with Spark versus rolling a new database. The SQL layer in this database (GemFire/Geode) already had a number of optimizations we could use to speed up Spark SQL queries, even over the native Spark cache.
Like some of the other comments in this thread, the idea was to provide all the guarantees of a OLTP store (HA, ACID, Scalability, Mutations etc) with the powerful analytic capabilities of Spark.
Appreciate these comments, the site did not go through much testing before being deployed. Overflowing was modified to eliminate horizontal scroll on mobile but it looks like there were some vertical issues as well. We will get this fixed
Our impression was that when Databricks released the billion-rows-in-one-second-on-a-laptop benchmark, readers were pretty awed by that result. We wanted to show that when you combine an in-memory database with Spark so it shares the same JVM/block manager, you can squeeze even more performance out of Spark workloads (over and above Spark
's internal columnar storage). Any analytics that require multiple trips to a database will be impacted by this design. E.g. workloads on a Spark + Cassandra analytics cluster will be significantly slower, barring some fundamental changes to Cassandra.
A web design QA note for all: thin fonts (e.g 300-400 weight) as a body font but work fine on macOS due to better font rendering, but do not work well on Windows.
It takes about as long as it takes to start up a Spark cluster, and you can interact with it entirely through Spark APIs if that's what you're comfortable with. It can also be used in "Split Cluster Mode" so you can use your existing Spark build instead of what's embedded within SnappyData if you prefer.
So, you guys forked gemfire, spark and spark-jobserver. Are you planning on going back into mainline or will you be maintaining the forks to stay current at some interval?
Just to be clear, we do support Snappydata as a library (unfortunately, the docs don't make this clear) that can work with your distribution (upto 1.6 today). Our releases will support the latest Spark version in a staggered manner.
Everything we have done is an extension to Spark. None of the existing functionality is lost. In a sense, we have turned Spark to work more like a database. So, don't think we can turn it back into the "mainline" (assuming that is what you meant).
SnappyData: https://www.snappydata.io, MemSQL: https://www.memsql.com/, Splice Machine: https://www.splicemachine.com/, SAP Hana: https://www.sap.com/products/hana.html, GridGain: https://www.gridgain.com/
are some of the technologies within it