Apache Pinot 1.0

emmanueloga_ · on Sept 19, 2023

Does anyone understand how the Apache foundation works? Do projects receive monetary funding, or is it just the “prestige” of becoming an Apache project? What’s the advantage of being under their umbrella?

At this point, any legitimacy of working with their foundation may be lost under the weight of hundreds or even thousands of projects of unknown quality levels (I’m not talking about this project’s merits, which I know nothing about).

latchkey · on Sept 19, 2023

20+ year Apache Member here... yea, it is pretty much prestige. But you get community, infrastructure, legal, branding, as well as mentoring on 'how do to open source'.

It is all pretty well documented. Here are a couple good links to get you started...

The Apache Way

https://www.apache.org/theapacheway/

The PMC oversees the projects:

https://www.apache.org/dev/pmc.html

drewda · on Sept 19, 2023

Apache Foundation provides well-trod legal path for large corporations to release their internal code as open-source.

They do have some competition. Linux Foundation is another large non-profit that creates umbrella entities for a bunch of open-source software originally created within larger tech companies. I get the impression that Apache Foundation goes for breadth, taking any and all donations, while Linux Foundation goes for depth in specific topics.

In terms of funding, for open-source projects originally created within a larger company, that company will often provide a financial donation to the foundation that is taking on its ongoing management. The foundation will also take a cut of future donations to the project, to pay for the administrative overhead of the non-profit.

pests · on Sept 20, 2023

Theres also the Cloud Native Computing Foundating (CNCF) which k8s & co are under.

random3 · on Sept 20, 2023

CNCF is a "project" under the Linux Foundation.

Lucasoato · on Sept 19, 2023

One thing that totally surprised me is that they claim to support Spark, but only if the Spark application runs in the Pinot cluster. There is no Pinot spark writer format, you need to change the way you’re deploying Spark drastically for example if you’re using Databricks or K8s as a master.

cpsoman · on Sept 20, 2023

There is indeed Spark support for writing new data into Pinot (https://docs.pinot.apache.org/basics/data-import/batch-inges...) as well as to query it (https://github.com/apache/pinot/blob/master/pinot-connectors...).

This does not run inside the Pinot cluster - you can use standard Spark execution engine to run this ingestion. In addition, Pinot also supports an out of the box ingestion capability from batch sources using the Minion framework (https://docs.pinot.apache.org/basics/components/cluster/mini...) that does not need any external component (like Spark)

Lucasoato · on Sept 20, 2023

This is actually very different from all the rest of Spark integrations publicly available. Remember that while the setups in the docs you linked were normal 10 years ago (I'm thinking about old and expensive cloudera clusters), right now most of the people can use Spark without ever needing to use the spark-submit CLI (ex. Databricks).

The way proposed forces the user to separate Spark applications, what if you want to write the result of a Dataframe and save it directly to Pinot? Most of other storage service support it (jdbc, S3, redshift ... )

This is not just a flavor preference, this just wouldn't work with the evolving Spark ecosystem (like Spark Connect or even just PySpark).

random3 · on Sept 20, 2023

It's a lot of work to implement the file storage from scratch and there are good options out there (e.g. BigTable which has crazy latency SLAs)

If you get the building blocks right it can be fairly easy to build a decent composable OLAP cube. We built the core stack in under 30KLOC and used it at scale for non-trivial stuff. https://medium.com/hstackdotorg/hbasecon-low-latency-olap-wi...

There's a lot more good tech than we had 10 years ago and it should be easy (and fun) to roll high-level stacks that have top performance these days. Something that feels like and lightweight like a library but scales out linearly.

politelemon · on Sept 19, 2023

So is this similar to Amazon's Athena? I'm trying to place what a 'realtime distributed OLAP datastore' is, or competes with, in cloudy/naive terms.

PeterCorless · on Sept 20, 2023

Here's a better side-by-side comparison of Apache Pinot vs. Apache Druid vs. Clickhouse made back in April 2023. Let me know if anyone sees any updates or corrections that should be made. Would be more than happy to update.

https://startree.ai/blog/a-tale-of-three-real-time-olap-data...

mjirv · on Sept 20, 2023

Potentially dumb question, but why does “who viewed my profile” (the feature mentioned in the post as the original use case at LinkedIn) require a realtime OLAP datastore anyway?

PeterCorless · on Sept 20, 2023

That's a great question. A bit of history. Here's more on Pinot, back when it was invented at LinkedIn, before it was ASF-incubated as "Apache Pinot."

LinkedIn originally had been using traditional OLTP systems to power the "Who Viewed My Profile" app; they just simply hit the wall. So LinkedIn looked at other analytics databases at the time. Most just couldn't handle a large number of QPS — which, as a social media platform, they readily anticipated. This is why they defined the category as "user-facing, real-time analytics."

[This is terminology mostly specific to the Apache Pinot crowd, though I see that StarRocks / CelerData also recently started talking about user-facing analytics. So I wrote up an article explaining it here:

https://startree.ai/resources/what-is-user-facing-analytics ]

Other similar extant systems LinkedIn benchmarked at the time just couldn't give them the numbers they needed at the low latency, high concurrency and large scale they anticipated — terabytes to petabytes of data. So they wrote their own solution.

Pinot was originally intended for marketing purposes to capture live intent & action data.

The same Pinot infrastructure eventually grew to other real-time use cases. "Who Viewed My Profile" was followed by "Company Follow Analytics," then sales or recruiting, and even internal A/B testing.

[I wasn't at LinkedIn while this happened; this lore was passed down to me by others. Disclosure: yes I work at StarTree.

https://engineering.linkedin.com/analytics/real-time-analyti... ]

jovezhong · on Sept 20, 2023

LinkedIn is just another social network, more work related. When you post something new, you do want to see how many likes/comments. Just like Ins. "Who viewed my profile" is real in LinkedIn. It can be someone who is hiring, someone who may watch your talk, someone who may buy your product. I personally started some business conversations when I realized someone viewed my profile, or at least add they as new connections.

mjirv · on Sept 20, 2023

Sure, I understand why you’d want real-time. I guess the OLAP part is the one I’m not sure about here.

Most of these use cases seem more “pull all data for a specific user/post/company” and less “do analytical queries on millions+ of column values.”

PeterCorless · on Sept 20, 2023

Got it. To that end, Apache Pinot has a special index that allows certain dimensions to be drilled down on further, more granularly, than others, called the star-tree index. It's part of what makes Pinot so fast.

https://engineering.linkedin.com/blog/2019/06/star-tree-inde...

fiddlerwoaroof · on Sept 19, 2023

My impression is that it’s in the same space as RedShift, Snowflake, Citus, Greenplum, ClickHouse.

zX41ZdbW · on Sept 19, 2023

It's hardly comparable with ClickHouse. Even loading a table with 100M rows is not an easy endeavor in Pinot: https://github.com/ClickHouse/ClickBench/blob/main/pinot/ben...

fiddlerwoaroof · on Sept 19, 2023

Yeah, ClickHouse seems to me to be best in class, but I think Pinot is still aiming at the same general market segment.

riku_iki · on Sept 20, 2023

that benchmark is very limited in use cases (no heavy grouping, no joins, very small dataset), I think it is more for self-advertisement purposes

PeterCorless · on Sept 20, 2023

JOINs are a non-trivial problem with data in the terabyte-to-petabyte range. Apache Pinot had to re-architect itself with a multi-stage query engine in order to handle native query-time JOINs. There was a separate blog released today that dealt with that specifically:

https://startree.ai/blog/query-time-joins-in-apache-pinot-1-...

[Edit: if users just naively throw query-time JOINs at a problem, they might not get results in the time they want — a non-optimized real-time JOIN took upwards of 39 seconds. With predicate pushdowns, and partition-aware JOINs, suddenly results could be done in 3.3 seconds — faster by an order of magnitude. Still kinda long for a typical page or mobile app refresh but survivable. And yes, even faster subsecond results were possible with additional compute parallelism, but that has a cloud services cost associated with it.

Pre-ingestion JOINs, such as through Apache Flink, would be generally more performant than query-time JOINs. And that's how most people do them right now anyway. So just because you can do query-time JOINs doesn't mean you should if you haven't thought about it ahead of time. If you can, optimize the data partioning to ensure best performance. The good news is that users now have a lot more flexibility in how they want to do JOINs.]

riku_iki · on Sept 20, 2023

> JOINs are a non-trivial problem with data in the terabyte-to-petabyte range.

but that benchmark is 250G, plus all established players I think support joins well, I work with BQ closely and it works very well.

> So just because you can do query-time JOINs doesn't mean you should if you haven't thought about it ahead of time.

prebuilding denormalized table can significantly increase your datasize, so it is tradeoff and depends on your data structure and cardinality.

saberience · on Sept 20, 2023

Are you comparing BigQuery to Pinot? They are for totally different use cases.

riku_iki · on Sept 20, 2023

I don't agree with "totally", many usecases, especially where joins are involved totally overlap.

PeterCorless · on Sept 20, 2023

Very true.

PeterCorless · on Sept 20, 2023

Exactly.

fiddlerwoaroof · on Sept 20, 2023

I’m speaking from experience, not from those benchmarks.

riku_iki · on Sept 20, 2023

somehow I have different experience, anything unlike some simple select few fields from t where k = ? usually causes all kinds of errors and stability issues on good size of data.

glogla · on Sept 21, 2023

You can be both right. I would say that ClickHouse is a focused system - it is really really good at the things it's good at, and it is not good at the rest.

Anecdote: I tested ClickHouse as possible replacement of Trino+DeltaLake+S3 for some of our use cases.

When querying a precomputed flat tables it was easily 10x to 100x faster. When running complex ETL to prepare those tables, I gave up when CTAS with six CTEs that takes Trino 30 seconds to compute on the fly turned into six intermediate tables that took 40 minutes to compute, and I wasn't even halfway done.

The tricky part is, how do you know whether your use case fits? But you have to ask this question about all the "specialized tools", including Pinot.

ren_engineer · on Sept 19, 2023

I think Pinot has more focus on real-time stuff, the ones you listed would be more in the data warehouse category

fiddlerwoaroof · on Sept 20, 2023

I’ve used Snowflake and ClickHouse, at least, in near-real-time settings. I think you’re right that Pinot (and Druid/ClickHouse) are more suitable for this than the others, though.

glogla · on Sept 19, 2023

RedShift, Snowflake, Citus, Greenplum and Athena are OLAP engines, but not real-time focused. For this one, it is more similar to Druid, ClickHouse or RockSet.

The 1.0 version of Pinot seems to bring a lot of maturity, they seem to have added new engine that can do joins now. I'm not sure how stable it is, but it seems interesting.

As for what is this kind of database usedful for, this is for operational analytics on large data that also update in real time. In my domain that would be things like having insight into large supply chains or manufacturing operations, like power plants or factories, just in general for monitoring stuff. I know it's also used in security and finance (for fraud).

jpgvm · on Sept 20, 2023

Not really no. Athena is most likely just Presto under the hood.

This is a completely different beast in terms of expected query latency (sub-second) and as a result has different constraints about structure of data, amount of data queried and cost naturally.

You can think of Pinot and Druid as solving for the real-time analytics cases, like say for instance you run an ad network and you want to provide your customers with a live dashboard of ad impressions and allow segmentation of that dashboard by cohorts and other metrics on the fly.

glogla · on Sept 21, 2023

Nowadays, Athena is Trino under the hood, but pretty heavily patched.

jpgvm · on Sept 22, 2023

Yeah that makes sense.

gregw2 · on Sept 19, 2023

I poked around trying to find a high level understanding.

Here's the best place to start from what I could tell: https://docs.pinot.apache.org/basics/concepts

Based on that, it's a MPP columnar database focused on low-latency streaming-ingested/realtimeish use cases open sourced by LinkedIn's infra teams:

"Pinot is designed to deliver low latency queries on large datasets. To achieve this performance, Pinot stores data in a columnar format and adds additional indices to perform fast filtering, aggregation and group by.

Raw data is broken into small data shards. Each shard is converted into a unit called a segment. One or more segments together form a table, which is the logical container for querying Pinot using SQL/PQL.

... Logically, a cluster is simply a group of tenants. As with the classical definition of a cluster, it is also a grouping of a set of compute nodes. Typically, there is only one cluster per environment/data center. There is no needed to create multiple clusters since Pinot supports the concept of tenants. At LinkedIn, the largest Pinot cluster consists of 1000+ nodes distributed across a data center. The number of nodes in a cluster can be added in a way that will linearly increase performance and availability of queries."

Also per https://docs.pinot.apache.org/basics/getting-started/frequen...

Q: When are new events queryable when getting ingested into a real-time table?

A: Events are available to queries as soon as they are ingested. This is because events are instantly indexed in memory upon ingestion.

The ingestion of events into the real-time table is not transactional, so replicas of the open segment are not immediately consistent. Pinot trades consistency for availability upon network partitioning (CAP theorem) to provide ultra-low ingestion latencies at high throughput. However, when the open segment is closed and its in-memory indexes are flushed to persistent storage, all its replicas are guaranteed to be consistent, with the commit protocol.

... Q: Why are segments not strictly time-partitioned?

A: It might seem odd that segments are not strictly time-partitioned, unlike similar systems such as Apache Druid. This allows real-time ingestion to consume out-of-order events. Even though segments are not strictly time-partitioned, Pinot will still index, prune, and query segments intelligently by time intervals for the performance of hybrid tables and time-filtered data. When generating offline segments, the segments generated such that segments only contain one time interval and are well partitioned by the time column.

PeterCorless · on Sept 20, 2023

Thanks for looking deeper! There are two other articles that might be up your alley.

Query lifecycle optimizations https://startree.ai/blog/what-makes-apache-pinot-fast-chapte...

Indexing https://startree.ai/blog/what-makes-apache-pinot-fast-chapte...

data-x · on Sept 20, 2023

Looks pretty neat. The question that is probably most critical to answer is what's the latency difference when I do joins vs when I run queries on denormalized data. Can it support similar level of concurrency when using joins?

PeterCorless · on Sept 20, 2023

Answers in this blog.

https://startree.ai/blog/query-time-joins-in-apache-pinot-1-...