Hi. I'm the presenter. Thanks for the interest. Opinions here are my own. I'll p...

dan-robertson · on Aug 25, 2024

This sort of general architecture (store parquet-like files somewhere like s3 and build a metadata database on top) seems reasonably common and gives obvious advantages for storing lots of data, scaling horizontally, and scaling storage and compute separately. I wonder where you feel your advantages are compared to similar systems? Eg is it certain API choices/affordances like the ‘time travel’ feature, or having in-house expertise or some combination of features that don’t usually come together?

A slightly more technical question is what your time series indexes are? Is it about optimising storage, or doing fast random-access lookups, or more for better as-of joins?

jjmunro · on Aug 26, 2024

We do have a specialist time-series index, optimised for things like tick-data. It compresses fairly well but we generally optimise for read-time. Not all over the place random-access, but slicing out date-ranges. There are two layers of index, a high level index of the data-objects, and the index in each object in S3.

A built-in as-of join is something we want to build.

joewood1972 · on Aug 25, 2024

For example, Apache Iceberg is exactly this. Complete with bitemporal query support.

dan-robertson · on Aug 25, 2024

I feel like ‘exactly’ is doing a lot of work in your comment and I am interested in the reasons that that word may not be quite the right word to describe these situations.