Hacker Newsnew | past | comments | ask | show | jobs | submit | acidity's commentslogin

This is not a fly.io specific question but I always wonder how in these globally distributed system, how are databases handled?

I understand you can put your application server in any location but generally there is only one storage so are these application servers doing cross region database calls?

Having only worked with single cluster setup web apps, I am always curious about this part.

Is the answer always - use a managed replicated database and send read queries to one near your location and all write queries goes to the primary instance?


That's our answer, yes. Read replicas and we redirect requests that need writes to a "primary" region: https://fly.io/blog/globally-distributed-postgres/

We tried cross region database calls for HTTP requests. They were bad. They seem to work ok for long lived connections like websockets, though.


Using something like CockroachDB you should be able to avoid the single-writer situation. It has it's own complexities I'm sure but seems very compelling.


Would need to commercial licence for this to work though


Thanks. My initial question arose where does products like SnowFlake/RedShift falls into this. They seem to be coined as datawarehouse but seems they do the same as OLAP work. But looking at Snowflake docs I read:

"""Currently there is no option similar to an SSAS cube in Snowflake. Once data is loaded into the databases Snowflake allows to query the data in a way similar to traditional OLTP databases. For data aggregations the Snowflake library offers rich sets of in-built functions. Together with UDFs, SPs and Materialized Views we can build custom solutions for precomputed data aggregations. For data analysis we still have to rely upon third party tools. Snowflake provides a variety of different connectors to access its database objects from other analytical tools. There are plans in near future to introduce an integrated tool for data aggregations, analysis and reporting."""

So they say, its not really an OLAP.

And how exactly it differs from OLTP? Is a web app OLTP?


The difference between OLAP and OLTP, which I will interpret a bit from the original "OLAP Manfesto", is that OLAP is about providing access to data that leverages as much resource as necessary to produce analytical results in the query. OLTP was about transaction processing, and the emphasis was on reducing the number of elements that were required to be locked and changed during transaction processing. Normalizing data reduces moving parts. Long chains of aggregations and calculations consume lots of resources. OLAP was introduced to contrast the performance of specialized machinery with what RDBMS were typically good at, circa 1993, which was imperfect since DSS databases had been around for a long time by then.

A typical dimensional star schema is good for giving you various sums, counts, etc. Unlike classic specialized multidimensional analysis and planning tools, you can derive an arbitrary set of subselections and additional calculations within a query, subject to the limits of the SQL version. For example, concomitance and basket analysis of what products sell with other products gives you a result that has two product keys, product and 'along-with-product'. In an OLAP cube with just a single product dimension, you can't derive this at all. So, given a data set, the RDBMS gives you a greater ability to filter and aggregate. It just comes at the cost of generating the right SQL.

For the specialized (sometimes called multidimensional, but they're all multidimensional) tools like SSAS, you trade away overall flexibility of querying when you create a cube, since you predefine all the major relationships, but you gain expressiveness in how you model your calculations, and you make consuming the data easier. An RDBMS provides no metadata that supports user traversal of the dimensional space, and there are lots of useful requests that cannot be modeled as a relational query without a huge amount of information that needs to be injected into the query. As a very small but tractable example, suppose you want to see result rows sequenced as (budget, forecast, actual)- this is not alphabetical, so the query needs to define some intermediate table with rows of labels and sort keys like (budget, 1), (actual, 2), (forecast, 3). Specialized systems make it much easier to both create a precise result form, especially when the data would map to multiple fact tables otherwise, and they also make long chains of calculations much easier to specify. A simple set of metadata definitions and a small amount of query text can specify what would take many KB of SQL.

The code required to provide a user interface to the results is also probably higher for relational. A relational query only has 1 cursor. A query in MDX or similar language has N cursors, 1 for each axis of result, plus the matrix data set formed by crossing the axes. This allows much easier specification of the result that will be rendered for the user to consume. Shredding a single SQL result cursor into a crosstab requires plenty of code, and issuing a batch of queries (1 for each crosstab edge, 1 for the matrix) requires less code but still enough.

Hope this helps!


Can you expand on this? We are just starting on Helm so curious to know your thoughts.


Sure, the most asinine thing (borrowing from Rob Pike) is to have a system where invisible characters define the scope and semantics of what you are writing. Now Helm takes this one step further (and I one beyond that before saying no more to myself and discovering https://cuelang.org) and starts using text interpolation with helpers for managing indentation in this invisibly scoped language. I hacked in imports, but was like, ok, making this worse.

So there's this problem and a number of experiments are going on. One camp has the idea of wrapping data / config in more code. These are your Pulumi and Darklang like systems. Then there is another camp that say you should wrap code in data and move away from programming, recursion, and Turing completeness. This seems like the right way to me for a lot of reasons both technical and haman centric.

I've pivoted my company (https://github.com/hofstadter-io/Hof) to be around and powered by Cue. Of the logical camp, it is by far going to be the best and comes from a very successful lineage. I'm blown away by it like when I found Go and k8s.


Recently migrated my company's k8s product to Cue and it's pure bliss.

Configuration should be data, not code. Cue has just the right amount of expressivity - anything more complex shouldn't be done at the configuration layer, but in the application or a separate operator.


What's the k8s product?

Are you on the newer syntax? #StyleDefs


It's not yet public - it's a new server OS for k8s. Think of CoreOS, but without a traditional userland and fully stateless.

And yes, we're on the newer syntax :-)


Just wanted to clarify that Dark isn't like Pulumi. The idea with Dark is that there is no configuration, just code.


There is still configuration, there has to be, you've just wrapped it so much it's not visible anymore (which is even worse than Pulumi, at least they are using an existing language). You still have to express (and write) the same information...

Darklang is solidly in the Pulumi camp, that's where outsiders put it. (I have seen the insides without beta / your demo, someone with a beta account showed me around a bit)

The real problem with Darklang is they have their own custom language and IDE. What exactly are you trying to solve?


As a side note, is there something similar for building a distributed application (could be a very simple NoSQL DB or maybe some stream processor).


Newbie question but will this help me also write desktop apps?


Yep, you can build [UWP](https://docs.microsoft.com/en-us/windows/uwp/get-started/uni...) apps with it, not sure if there's a OS X desktop solution


As an Indian who did MS in US. The parent is 100% correct.


He's not. Think back about your peer group and see if you can pick many people that did a MS because there's an extra 20000 slots. It was zero for me.


>>> Weird. OK, I tried it with some other parameters, and it suddenly took 15 seconds at 100% CPU, with PostgreSQL doing some linear scan through data. Using EXPLAIN I found that with full production data the query planner was doing something idiotic in some cases. I learned how to impact the query planner, and then this query went back to taking only a few milliseconds for any input. With this one change to influence the query planner (to actually always use an index I had properly made), things became dramatically faster. Basically the load on the database server went from 100% to well under >>> 5%.

I am actually interested in this part. Figuring out issues with EXPLAIN is one of my favorite things.


I'm replying due to this same part/paragraph. I've been dealing with some manifestation of EXPLAIN since Oracle 6. In the last few years I've become handy with PostgreSQL's EXPLAIN as well, and this reminds me of my biggest hang-up regarding PostgreSQL; their hostility toward optimizer hints.

Like Mr. Stein I too have found myself in bad places with PostgreSQL's optimizer. This is commonplace with relational systems; every such system I've ever dealt with, including all versions of Oracle since the mid 90's, Informix, MS-SQL, DB/2 (on AS/400, Windows and Linux,) and PostgreSQL eventually get handed a query and a schema that produces the wrong plan and has intolerably bad performance. No exception. None of these attempts to create flawless optimizers that anticipate every use case has ever succeeded, PostgreSQL included.

With other systems there are hints that, as a last resort, you can apply to get efficient results. Not so much with PostgreSQL. Not implementing the sort of hints that solve these problems (as opposed to the often ineffectual enable_* planner configuration, unacceptable global configuration and other workarounds needed with PostgreSQL) is policy:

"We are not interested in implementing hints in the exact ways they are commonly implemented on other databases. Proposals based on 'because they've got them' will not be welcomed."

How about proposals based on "because your hint-free optimizer gets it wrong and I require a working solution without too many backflips and somersaults or database design lectures." No? Then sorry; I can't risk getting painted into a corner by your narrow minded and naive policy. PostgreSQL goes no further than non-critical, ancillary systems when I have say in it. And I do.


You're absolutely correct. Optimizer hints are a source of contention in Postgres. I'm sure you're already aware, but just for completeness, here's some more information on optimizer hints and Postgres from Simon Riggs back in 2011. I don't expect this to sway opinion one way or the other. It's useful to read more on Postgres' policy.

http://blog.2ndquadrant.com/hinting_at_postgresql/

From the introduction:

- Introducing hints is a common source of later problems, because fixing a query place once in a special case isn’t a very robust approach. As your data set grows, and possibly changes distribution as well, the idea you hinted toward when it was small can become an increasingly bad idea.

- Adding a useful hint interface would complicate the optimizer code, which is difficult enough to maintain as it is. Part of the reason PostgreSQL works as well as it does running queries is because feel-good code (“we can check off hinting on our vendor comparison feature list!”) that doesn’t actually pay for itself, in terms of making the database better enough to justify its continued maintenance, is rejected by policy. If it doesn’t work, it won’t get added. And when evaluated objectively, hints are on average a problem rather than a solution.

- The sort of problems that hints work can be optimizer bugs. The PostgreSQL community responds to true bugs in the optimizer faster than anyone else in the industry. Ask around and you don’t have to meet many PostgreSQL users before finding one who has reported a bug and watched it get fixed by the next day.

Now, the main completely valid response to finding out hints are missing, normally from DBAs who are used to them, is “well how do I handle an optimizer bug when I do run into it?” Like all tech work nowadays, there’s usually huge pressure to get the quickest possible fix when a bad query problem pops up....


Having never used postgres, this seems extremely scary. I've only ever needed index hints a couple times in other databases, but when you need them there isn't usually an alternative. Messing with global knobs is a good way to cause more problems than you are solving.

I've even entertained the idea that every query should be hinted. For OLTP workloads, you practically always know exactly how you want the DB to execute your query anyways. And often times you find out very late that the query planner made the wrong choice and now your query is taking orders of magnitude longer than it should (worse, sometimes this changes at runtime). I've never actually gone through with this religiously though...


>> Having never used postgres, this seems extremely scary. I've only ever needed index hints a couple times in other databases, but when you need them there isn't usually an alternative. Messing with global knobs is a good way to cause more problems than you are solving.

You've got the plot exactly. The last such battle I was involved with ended in creating a materialized view to substitute for several tables in a larger join; without the view there was no way[1] to get an acceptable plan. Creating this view was effectively just a form of programming our own planner. And yes, the need to update the view to get the desired result is an ongoing problem; one that's scheduled to get solved with a migration to another DB.

Like you I've never been all that quick to employ hints. I tend to use them while experimenting during development or troubleshooting and avoid them in production code. But there have been production uses, and you know what? The world did not end. No one laughed at or fired me. No regulatory agency fined me. It did not get posted on Daily WTF. No subsequent maintenance programmer has ever shown up at my home in the dead of night. It just solved the problem, quickly and effectively.

Sure would be nice if people purporting to offer a fit-for-purpose relational systems understood the value of a little pragmatism.

[1] given the finite amount of time we could sacrifice to deal with it


Just to be clear, those settings are global per query, not global for the entire server. Still that makes them almost useless for large queries, but I would not exactly call them scary.


I added code to my clients to do "SET enable_nestloop TO off" anytime they connect to the database. This sets a global flag for that session, which disables nestloop query planning. It could indeed impact other unrelated queries, which is deeply disturbing, but I don't know any way to disable nestloop query planning only for a specific query (aside from constantly setting and unsetting that flag?). Incidentally, here is an example of the query that causes all the trouble: "SELECT * FROM file_use WHERE project_id = any(select project_id from projects where users ? '25e2cae4-05c7-4c28-ae22-1e6d3d2e8bb3') ORDER BY last_edited DESC limit 100;" Search for nestloop in https://github.com/sagemathinc/smc/blob/master/src/smc-hub/p... to see my relevant client code.


> I don't know any way to disable nestloop query planning only for a specific query

You can do that by using SET LOCAL. Here's what your query would become:

  BEGIN;
  SET LOCAL enable_nestloop = off;
  <QUERY>
  COMMIT;
SET LOCAL applies the setting, but only within the current transaction.

If you post a link to the output from EXPLAIN, I could probably advise you the right way to handle this query.


Thanks!! Also, here's EXPLAIN, examples with timings, and the relevant part of the database schema: https://gist.github.com/williamstein/fb31e07d4057232bd3a3e78... (My email: wstein@sagemath.com)


It's a bit difficult to tell what exactly is going on. For future reference EXPLAIN ANALYZE provides a lot more information into the execution of the query. It tells you how how much time was spent in each part of the plan as well as how many rows were produced by each part.

From what I can tell, this query is getting the 100 last edited files of projects a user is part of. The way it is currently executing is by iterating through the most recently edited files, sees if the user belongs to the project of the file, and repeats until it finds 100 files of projects the user belongs to. Since the query returns no results, I'm you are running the query for a user that is not a part of any project, or of only empty projects. This means the query is looking up the projects of every single file only to find that none of them belong to a project the user was a part. You can check this by running EXPLAIN ANALYZE.

I'm not sure, since you didn't post the EXPLAIN of the query with enable_nestloop = off, but here's what I think is happening. You are getting a merge join between the projects table and the file_use table with the file_use_project_id_idx. If this is correct, this means Postgres first scans through all of the projects and finds the ones the user belongs to. Then it looks up all of the files that are part of one of those projects. Then it sorts those files by the time they were last edited and takes the top 100. I'm not sure if that is what's exactly happening, but I'm sure something similar to it is. You can check how accurate my guess is by running EXPLAIN/EXPLAIN ANALYZE.

The first thing I would try is creating a GIN index on the users field which can be done with the following:

  CREATE INDEX ON projects USING GIN (users jsonb_ops). 
What I would expect to see is a nested loop join between projects and files_used. The query should use the GIN index to find all projects the user belongs to. Then use the file_use_project_id_idx to get the files for each of the projects. Then sort the files by the last time they were edited and take the top 100.


Another one that I have used in multiple of my projects: https://github.com/mattupstate/overholt


This is so cool. I am actually planning to start working on something similar but for wet shaving market. Basically, I was trying to setup something where user can choose their after shave pairing based on scent details.

Do you have plans to open source your algorithm?


Seems like you used React for the project above. Did you build a view/controller framework in house or something more common?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: