Does anyone understand how the Apache foundation works? Do projects receive monetary funding, or is it just the “prestige” of becoming an Apache project? What’s the advantage of being under their umbrella?
At this point, any legitimacy of working with their foundation may be lost under the weight of hundreds or even thousands of projects of unknown quality levels (I’m not talking about this project’s merits, which I know nothing about).
20+ year Apache Member here... yea, it is pretty much prestige. But you get community, infrastructure, legal, branding, as well as mentoring on 'how do to open source'.
It is all pretty well documented. Here are a couple good links to get you started...
Apache Foundation provides well-trod legal path for large corporations to release their internal code as open-source.
They do have some competition. Linux Foundation is another large non-profit that creates umbrella entities for a bunch of open-source software originally created within larger tech companies. I get the impression that Apache Foundation goes for breadth, taking any and all donations, while Linux Foundation goes for depth in specific topics.
In terms of funding, for open-source projects originally created within a larger company, that company will often provide a financial donation to the foundation that is taking on its ongoing management. The foundation will also take a cut of future donations to the project, to pay for the administrative overhead of the non-profit.
One thing that totally surprised me is that they claim to support Spark, but only if the Spark application runs in the Pinot cluster. There is no Pinot spark writer format, you need to change the way you’re deploying Spark drastically for example if you’re using Databricks or K8s as a master.
This does not run inside the Pinot cluster - you can use standard Spark execution engine to run this ingestion. In addition, Pinot also supports an out of the box ingestion capability from batch sources using the Minion framework (https://docs.pinot.apache.org/basics/components/cluster/mini...) that does not need any external component (like Spark)
This is actually very different from all the rest of Spark integrations publicly available.
Remember that while the setups in the docs you linked were normal 10 years ago (I'm thinking about old and expensive cloudera clusters), right now most of the people can use Spark without ever needing to use the spark-submit CLI (ex. Databricks).
The way proposed forces the user to separate Spark applications, what if you want to write the result of a Dataframe and save it directly to Pinot? Most of other storage service support it (jdbc, S3, redshift ... )
This is not just a flavor preference, this just wouldn't work with the evolving Spark ecosystem (like Spark Connect or even just PySpark).
There's a lot more good tech than we had 10 years ago and it should be easy (and fun) to roll high-level stacks that have top performance these days. Something that feels like and lightweight like a library but scales out linearly.
Here's a better side-by-side comparison of Apache Pinot vs. Apache Druid vs. Clickhouse made back in April 2023. Let me know if anyone sees any updates or corrections that should be made. Would be more than happy to update.
Potentially dumb question, but why does “who viewed my profile” (the feature mentioned in the post as the original use case at LinkedIn) require a realtime OLAP datastore anyway?
That's a great question. A bit of history. Here's more on Pinot, back when it was invented at LinkedIn, before it was ASF-incubated as "Apache Pinot."
LinkedIn originally had been using traditional OLTP systems to power the "Who Viewed My Profile" app; they just simply hit the wall. So LinkedIn looked at other analytics databases at the time. Most just couldn't handle a large number of QPS — which, as a social media platform, they readily anticipated. This is why they defined the category as "user-facing, real-time analytics."
[This is terminology mostly specific to the Apache Pinot crowd, though I see that StarRocks / CelerData also recently started talking about user-facing analytics. So I wrote up an article explaining it here:
Other similar extant systems LinkedIn benchmarked at the time just couldn't give them the numbers they needed at the low latency, high concurrency and large scale they anticipated — terabytes to petabytes of data. So they wrote their own solution.
Pinot was originally intended for marketing purposes to capture live intent & action data.
The same Pinot infrastructure eventually grew to other real-time use cases. "Who Viewed My Profile" was followed by "Company Follow Analytics," then sales or recruiting, and even internal A/B testing.
[I wasn't at LinkedIn while this happened; this lore was passed down to me by others. Disclosure: yes I work at StarTree.
LinkedIn is just another social network, more work related. When you post something new, you do want to see how many likes/comments. Just like Ins. "Who viewed my profile" is real in LinkedIn. It can be someone who is hiring, someone who may watch your talk, someone who may buy your product. I personally started some business conversations when I realized someone viewed my profile, or at least add they as new connections.
Got it. To that end, Apache Pinot has a special index that allows certain dimensions to be drilled down on further, more granularly, than others, called the star-tree index. It's part of what makes Pinot so fast.
JOINs are a non-trivial problem with data in the terabyte-to-petabyte range. Apache Pinot had to re-architect itself with a multi-stage query engine in order to handle native query-time JOINs. There was a separate blog released today that dealt with that specifically:
[Edit: if users just naively throw query-time JOINs at a problem, they might not get results in the time they want — a non-optimized real-time JOIN took upwards of 39 seconds. With predicate pushdowns, and partition-aware JOINs, suddenly results could be done in 3.3 seconds — faster by an order of magnitude. Still kinda long for a typical page or mobile app refresh but survivable. And yes, even faster subsecond results were possible with additional compute parallelism, but that has a cloud services cost associated with it.
Pre-ingestion JOINs, such as through Apache Flink, would be generally more performant than query-time JOINs. And that's how most people do them right now anyway. So just because you can do query-time JOINs doesn't mean you should if you haven't thought about it ahead of time. If you can, optimize the data partioning to ensure best performance. The good news is that users now have a lot more flexibility in how they want to do JOINs.]
somehow I have different experience, anything unlike some simple select few fields from t where k = ? usually causes all kinds of errors and stability issues on good size of data.
You can be both right. I would say that ClickHouse is a focused system - it is really really good at the things it's good at, and it is not good at the rest.
Anecdote: I tested ClickHouse as possible replacement of Trino+DeltaLake+S3 for some of our use cases.
When querying a precomputed flat tables it was easily 10x to 100x faster. When running complex ETL to prepare those tables, I gave up when CTAS with six CTEs that takes Trino 30 seconds to compute on the fly turned into six intermediate tables that took 40 minutes to compute, and I wasn't even halfway done.
The tricky part is, how do you know whether your use case fits? But you have to ask this question about all the "specialized tools", including Pinot.
I’ve used Snowflake and ClickHouse, at least, in near-real-time settings. I think you’re right that Pinot (and Druid/ClickHouse) are more suitable for this than the others, though.
RedShift, Snowflake, Citus, Greenplum and Athena are OLAP engines, but not real-time focused. For this one, it is more similar to Druid, ClickHouse or RockSet.
The 1.0 version of Pinot seems to bring a lot of maturity, they seem to have added new engine that can do joins now. I'm not sure how stable it is, but it seems interesting.
As for what is this kind of database usedful for, this is for operational analytics on large data that also update in real time. In my domain that would be things like having insight into large supply chains or manufacturing operations, like power plants or factories, just in general for monitoring stuff. I know it's also used in security and finance (for fraud).
Not really no. Athena is most likely just Presto under the hood.
This is a completely different beast in terms of expected query latency (sub-second) and as a result has different constraints about structure of data, amount of data queried and cost naturally.
You can think of Pinot and Druid as solving for the real-time analytics cases, like say for instance you run an ad network and you want to provide your customers with a live dashboard of ad impressions and allow segmentation of that dashboard by cohorts and other metrics on the fly.
Based on that, it's a MPP columnar database focused on low-latency streaming-ingested/realtimeish use cases open sourced by LinkedIn's infra teams:
"Pinot is designed to deliver low latency queries on large datasets. To achieve this performance, Pinot stores data in a columnar format and adds additional indices to perform fast filtering, aggregation and group by.
Raw data is broken into small data shards. Each shard is converted into a unit called a segment. One or more segments together form a table, which is the logical container for querying Pinot using SQL/PQL.
...
Logically, a cluster is simply a group of tenants. As with the classical definition of a cluster, it is also a grouping of a set of compute nodes. Typically, there is only one cluster per environment/data center. There is no needed to create multiple clusters since Pinot supports the concept of tenants. At LinkedIn, the largest Pinot cluster consists of 1000+ nodes distributed across a data center. The number of nodes in a cluster can be added in a way that will linearly increase performance and availability of queries."
Q: When are new events queryable when getting ingested into a real-time table?
A: Events are available to queries as soon as they are ingested. This is because events are instantly indexed in memory upon ingestion.
The ingestion of events into the real-time table is not transactional, so replicas of the open segment are not immediately consistent. Pinot trades consistency for availability upon network partitioning (CAP theorem) to provide ultra-low ingestion latencies at high throughput.
However, when the open segment is closed and its in-memory indexes are flushed to persistent storage, all its replicas are guaranteed to be consistent, with the commit protocol.
...
Q: Why are segments not strictly time-partitioned?
A: It might seem odd that segments are not strictly time-partitioned, unlike similar systems such as Apache Druid. This allows real-time ingestion to consume out-of-order events. Even though segments are not strictly time-partitioned, Pinot will still index, prune, and query segments intelligently by time intervals for the performance of hybrid tables and time-filtered data.
When generating offline segments, the segments generated such that segments only contain one time interval and are well partitioned by the time column.
Looks pretty neat. The question that is probably most critical to answer is what's the latency difference when I do joins vs when I run queries on denormalized data. Can it support similar level of concurrency when using joins?
At this point, any legitimacy of working with their foundation may be lost under the weight of hundreds or even thousands of projects of unknown quality levels (I’m not talking about this project’s merits, which I know nothing about).