PostgreSQL 9.4 – What I was hoping for

silvestrov · on Feb 15, 2014

I'd rather have a solid codebase that's easy to maintain and has no quick hacks. Better take the time to do it correctly than having features that sometimes doesn't work.

PostgreSQL didn't become PostgreSQL by doing it the MySQL way.

redstripe · on Feb 15, 2014

But the crappy codebases already exist outside of postgres anytime people have to roll their own upsert implementation.

The fact that it's hard is exactly why it should be in the database.

cookiecaper · on Feb 15, 2014

I don't think there's any contention at this point that it's a useful feature to have in the database. There just seems to be some disagreement on the finer points of the implementation. It'll get worked out, and Postgres will make it through another release with or without upsert.

jeltz · on Feb 15, 2014

Indeed, I followed the discussion and it was almost entirely about which implementation would best handle concurrency concerns.

craigkerstiens · on Feb 15, 2014

I'm definitely not advocating for rushing features in just for the sake of it, much of the reason for these missing was not due to poor implementations of them which is what makes it unfortunate.

lucian1900 · on Feb 15, 2014

I would agree entirely, except perhaps about MERGE. I really wanted that.

Unifying JSON and hstore can happen later, since its wouldn't necessarily enable new features, just make some things nicer.

yeukhon · on Feb 15, 2014

Yeah. I see this will go on until 9.7, 9.8 before we can call things "really nice and stable" for JSON and HSTORE.

chandrew · on Feb 15, 2014

I used postgres for the first time last week, and was wondering, "WHERE IS UPSERT?" Ended up going with a "UPDATE table SET field='C', field2='Z' WHERE id=3; INSERT INTO table (id, field, field2) SELECT 3, 'C', 'Z' WHERE NOT EXISTS (SELECT 1 FROM table WHERE id=3);" (as seen on http://stackoverflow.com/questions/1109061/insert-on-duplica... - although, not the chosen answer). Anyone come up with better or their own solutions?

SDGT · on Feb 15, 2014

I built the pgscript version of the some upsert example and use it as a stored function. I'm no pgsql pro, so sorry if this is total shit. Works for what I needed. This is controlling a list of access rules to different services for doling out access on the controller level globally, or down to an individual action inside that controller. If I rewrote this on newer PG, I'd store the permissions text as a JSON datatype.

  CREATE OR REPLACE FUNCTION merge_privileges(key integer, data_controller text, data_permissions text)
  RETURNS void AS
  $BODY$
  BEGIN
      LOOP
          -- first try to update the key
          UPDATE privileges SET controller = data_controller, permissions = data_permissions WHERE user_id = key   AND controller = data_controller;
          IF found THEN
              RETURN;
          END IF;
          -- not there, so try to insert the key
          -- if someone else inserts the same key concurrently,
          -- we could get a unique-key failure
          BEGIN
              INSERT INTO privileges(user_id, controller, permissions) VALUES (key, data_controller,     data_permissions);
              RETURN;
          EXCEPTION WHEN unique_violation THEN
              -- do nothing, and loop to try the UPDATE again
          END;
      END LOOP;
  END;
  $BODY$
  LANGUAGE plpgsql VOLATILE
  COST 100;

np422 · on Feb 16, 2014

I prefer to use CTE in postgres to do upserts, this is a non complete example, but you get the general idea:

  WITH upsert AS (
    UPDATE some_table 
    SET attrib = 'foo'
    WHERE id = 123
    RETURNING *
  ) INSERT INTO some_table ( id , attrib )
    SELECT 123 as id, 'foo' as attrib
    WHERE 123 NOT IN (SELECT id FROM upsert );

jeltz · on Feb 16, 2014

This does not handle concurrent inserts though unless you rerun the code on unique violation.

np422 · on Feb 17, 2014

Correct, but if I write the upsert in that way - I actually understand what I'm doing... :)

If this is standard upsert query in your app, you should turn it into a procedure and parameterize, catch exceptions anyway.

Or you could mess around with transaction isolation.

burntsushi · on Feb 15, 2014

Nope, that's exactly what I use when I really need it, generalized to simple table/PK schemes.[1]

At first I wanted to try and do it within PL/pgSQL (e.g., from your SO link), but I couldn't figure out how to generalize it like I could with my the UPDATE ... INSERT WHERE. (I haven't really studied PL/pgSQL too hard yet.)

[1] - https://github.com/BurntSushi/nfldb/blob/master/nfldb/db.py#...

jl6 · on Feb 15, 2014

Do you mind if I ask what your use case is that demands an upsert?

chandrew · on Feb 15, 2014

I wanted to update a record if it existed, otherwise use insert. So if a name is in the database, update the times I've seen it. If the name does not exist, insert. I was keeping track of the number of times a name is seen.

jl6 · on Feb 15, 2014

Without knowing much about your actual application (i.e. treat the following with utmost skepticism!), my instinct would be to separate the fact of observing a name from the fact of the observation count. I would keep a log table and a summary table like this:

  CREATE TABLE namelog (
    name TEXT
  );

  CREATE TABLE namesummary (
    name TEXT,
    seen int
  );

Then each time you see a name, log it:

  INSERT INTO namelog VALUES (%s)

Then on a regular basis, run a housekeeping/summary process like this:

    CREATE TEMPORARY TABLE prevsummary AS
        SELECT * FROM namesummary;

    TRUNCATE namesummary;

    INSERT INTO namesummary
        SELECT coalesce(a.name, b.name) as name,
               coalesce(a.seen,0) + coalesce(b.seen, 0) as seen
        FROM 
            prevsummary a 
            FULL OUTER JOIN
            (SELECT name, count(*) as seen
            FROM namelog
            GROUP BY name) b
            ON a.name=b.name
    ;

    DELETE FROM namelog;

This of course is grossly space-inefficient, but making the logging process INSERT-only should enable a great degree of concurrency - and you can tune the summary process to run as often as necessary to keep memory/disk usage within your available resources.

If concurrency isn't an issue, then your initial solution is probably fine.

chandrew · on Feb 15, 2014

Very nice approach, I didn't think of trying to log names with only INSERT. I see how it handles concurrency. My solution is likely to fail if my database gets bombarded...My app was supposed to be a simple one I made for a few friends and also for me to learn a few things in node / postgres.

For more detail, here is my initial table: CREATE TABLE names(name_id serial, name text, count integer)

I was planning to access the name count a good numbers of times, so I went with a simple SELECT from names. I don't necessarily need "on demand" counts so having the summary process run 5 or 10 mins would be a better!

Thanks for your insight!

dkersten · on Feb 15, 2014

I quite like the idea of treating the database as append-only immutable data and aggregate views as optimisations.

electrum · on Feb 18, 2014

Please note that warning in the manual that TRUNCATE is not MVCC safe. You might want to use a normal "DELETE FROM namesummary" instead.

bostik · on Feb 15, 2014

Logical replication. Please let that be true.

I didn't know live data migration from one database version to another had a name, but I have been waiting for it for years. (Since 7.3, in fact.) And if I'm interpreting this right, that feature could also allow for a fairly straightforward way to move data from a running instance to a new one.

Perhaps not yet in the first release, but eventually that too could work.

twic · on Feb 15, 2014

> that feature could also allow for a fairly straightforward way to move data from a running instance to a new one

That's already relatively straightforward with PostgreSQL's existing replication. This has worked since 9.1:

http://www.rassoc.com/gregr/weblog/2013/02/16/zero-to-postgr...

It involves bouncing the master, but it doesn't involve extended downtime. Or is that not what you meant?

However, the 'logical log streaming replication' that is being implemented looks even better. A small but useful thing is that it works at database granularity, rather than having to replicate the whole installation. The huge thing is that because it works with logical rather than physical representations of changes, it allows replication into a database which is also being updated locally - slaves can be read/write rather than read-only. That itself is not immediately exciting, but the logical consequence of this is that two databases can both accept changes and replicate to each other - 'bi-directional replication', or what in MySQL is called 'multi-master replication'. The holy grail of database clustering!

bostik · on Feb 16, 2014

> The holy grail of database clustering!

Thank you for pointing this out, I would have missed it.In fact, thank you everyone who has answered with helpful technical details.

I have a couple of use-cases in mind where logical streaming replication could solve a number of problems. A personal project would benefit from a database-level setup due to easier data migration, a professional one would benefit from bi-directionality and logically straightforward clustering. (I also have some future designs in mind where replicating only subsets of a database would be VERY useful.)

Owing thanks to the replies, I now have some extra reading to do. :)

einhverfr · on Feb 16, 2014

Well, let's be clear about what's being worked on. The logical replication framework as I understand it is a framework for logical replication frameworks, not for the end user/administrator. I.e. it is there to allow some centralized logic between implementations like Bucardo and Slony, rather than be usable directly.

Logical replication, if you need it, is not some far-off feature from a sysadmin's perspective but something where there are currently at least three major implementations of it to date, namely Slony, Bucardo, and Londiste. Take your pick, set it up, and use it. These three have very different focuses and design decisions associated with them and so what's good for one environment may not be good for another.

So instead of continuing to wait for it, get out and try these solutions.

Just to note, the two main uses of logical replication (as opposed to physical replication which is what the streaming replication offers) is:

1. Live migrations between versions.

2. Replicating only a subset of a database (for example, into a system in a DMZ).

Every major implementation currently supports both of these approaches, but supporting #2 makes using it for #1 significantly harder. It might be nice to see a framework for #1 be built on top of an existing logical replication framework.

jeltz · on Feb 15, 2014

The plans for logical replication and the amount of work put into it seems serious so I would expect logical replication to land sooner or later (my guess is in 9.5, but I may be optimistic here).

Several patches of the ground work have already been committed with a bunch of other patches being worked on.

justintocci · on Feb 16, 2014

It doesn't matter what feature's you add to postgres, it will never be less appropriate for a project than another database that has some magic feature.

The reason people choose another database is because they favor an architecture where all the things that should be done in the database get done in the middleware. Features are irrelevant since the goal is do everything in the middleware anyway.

Once a person converts to the idea that its faster, safer and more reliable to use the database for what it was designed to do then they have left the jungle and found the road. And all roads lead to postgres.

I really don't care one iota about any of these features. They are all obscure and I avoid obscure features in deference to the guy that has to read my code next (mostly me).

fleetfox · on Feb 15, 2014

JSON in postgres is just a container type isn't it? Why is everyone expecting it to become document storage?

sehrope · on Feb 15, 2014

> JSON in postgres is just a container type isn't it?

At the moment JSON is stored in a text field so accessing a JSON field involves parsing the entire text value. If instead the JSON is stored in a parsed binary format then you can more efficiently access individual fields.

If you're only accessing a small set of known fields you can work around the issue by using plv8 to create function indexes on the fields that you'll be using. This doesn't work for arbitrary expressions though. If you want to filter a WHERE clause based on a non-indexed field the entire JSON text will need to be parsed.

Binary JSON storage makes all of this much faster with no change on the user's side. It's all transparent and just faster!

In theory it could also reduce storage space for tuples. Duplicate field names need not be repeated. I don't think it's in scope in the Postgres JSON improvements but it's a possibility.

> Why is everyone expecting it to become document storage?

Document storage in a relational database is really useful and there are plenty of use cases for it.

The standard example I use for JSON (or hstore) usage for Postgres is an audit table. You'd have all the usual audit fields (who/what/when) but you'd also have a "detail" field with event specific details. Using hstore or JSON for this is perfect. Improved support for JSON makes it much easier to provide event specific search.

gdulli · on Feb 15, 2014

> If you're only accessing a small set of known fields you can work around the issue by using plv8 to create function indexes on the fields that you'll be using.

You don't need plv8 to create an index on a JSON field, it's supported natively.

sehrope · on Feb 15, 2014

Yes my mistake on that. I'm mixing up 9.2 (where you had to do use plv8) and 9.3 (which has the native operator). The rest of the comment still stands though, a parsed representation makes the operator faster and more efficient.

atesti · on Feb 17, 2014

Isn't there an in place update of an existing database without dump and restore now for quite some versions (http://www.postgresql.org/docs/9.3/static/pgupgrade.html)? A tool which keeps the relation files, but builds new system catalogs around them? It had the impression that this tool would be very fast, but a bit risky. The linked article makes it sound like there is a need for local replication in order to do an update without (much) downtime. Is this just a (less risky!) alternative to the available in place update?

gleenn · on Feb 15, 2014

Upsert is really exciting, as the author mentions, because it is a huge thing keeping people from choosing Postgres instead of MySQL.

rbanffy · on Feb 16, 2014

Last time I proposed PostgreSQL over MySQL, what prevented adoption was lack of PostgreSQL support on Amazon's RDS. The time before that, it was that we lacked any DBA who knew PostgreSQL well enough.

I find lack of support for one specific command a very poor reason not to adopt an exceedingly well engineered database. It's like dismissing Oracle because of how ugly your DDL looks using VARCHAR2 types.

In contrast, having to pull products from production just to add indexes to large tables in MySQL was a huge nuisance.

morgo · on Feb 16, 2014

In response to your last point: MySQL 5.6 and above have online DDL.

gaadd33 · on Feb 15, 2014

Really? Why is that a primary show stopper when there are numerous solutions to it that live on the database side as a stored procedure?

gleenn · on Feb 15, 2014

There might be stored procedures etc, but not being an expert means I'd be concerned about performance and correctness. And the point is especially for perf because that's exactly when I want upsert: to save a lot of back and forth when inputting a lot of data. Also, not being standard means things like bulk upload Ruby gems not supporting faster methods of upload because they can't depend on stored procedures in any easy way that I can tell.

burntsushi · on Feb 15, 2014

I think gleenn is talking about the abstraction known as "upsert". Namely, there exists some syntax for expressing "either insert if it doesn't exist or update if it does exist; determine existence by primary key."

This process can be achieved in Postgres, but I'm not aware of any simple abstraction for it. But I could be wrong. Does there exist a stored procedure that can be generalized to the point of upsert abstractions in other relational databases? (In this thread, I see a stored procedure for a specific table/set of columns.)

pornel · on Feb 15, 2014

Yes, for me this is a showstopper. All MySQL projects I've worked on have REPLACE INTO and/or INSERT … ON DUPLICATE.

Postgres' equivalents of UPSERT that I've found either only work correctly with single-row inserts or are inefficient and complicated.

It's really hard to sell Postgresql to developers of a MySQL project when a simple 1-line query explodes into a complicated stored procedure for apparently no good reason.

__david__ · on Feb 16, 2014

> Yes, for me this is a showstopper. All MySQL projects I've worked on have REPLACE INTO and/or INSERT … ON DUPLICATE.

Curious. What is your use case?

From personal experience, I've used "ON DUPLICATE", but we ended up dumping it and making our API better/more secure. This was even before we switched to Postgres.

pornel · on Feb 16, 2014

Counters where the set of things you count is open-ended (e.g. words in a naive bayes classifier):

   INSERT INTO word_counts(word) VALUES('hello'),('world') 
      ON DUPLICATE KEY UPDATE cnt = cnt+1;

Lists where you track 'last seen':

   REPLACE INTO last_login(user_id, date) VALUES(123, NOW());

With Postgres' INSERT … RETURNING this would be awesome:

   INSERT INTO log(date, user_agent_id) VALUES(now(), 
      INSERT IGNORE INTO user_agents VALUES('Mozilla') RETURNING id)

This way I could store any long string only once, and do it atomically in one query.

dlau1 · on Feb 15, 2014

Two of those features:

Logical Replication

Upsert

-Would be HUGE for postgres