I'd rather have a solid codebase that's easy to maintain and has no quick hacks. Better take the time to do it correctly than having features that sometimes doesn't work.
PostgreSQL didn't become PostgreSQL by doing it the MySQL way.
I don't think there's any contention at this point that it's a useful feature to have in the database. There just seems to be some disagreement on the finer points of the implementation. It'll get worked out, and Postgres will make it through another release with or without upsert.
I'm definitely not advocating for rushing features in just for the sake of it, much of the reason for these missing was not due to poor implementations of them which is what makes it unfortunate.
I used postgres for the first time last week, and was wondering, "WHERE IS UPSERT?" Ended up going with a "UPDATE table SET field='C', field2='Z' WHERE id=3;
INSERT INTO table (id, field, field2)
SELECT 3, 'C', 'Z'
WHERE NOT EXISTS (SELECT 1 FROM table WHERE id=3);" (as seen on http://stackoverflow.com/questions/1109061/insert-on-duplica... - although, not the chosen answer). Anyone come up with better or their own solutions?
I built the pgscript version of the some upsert example and use it as a stored function. I'm no pgsql pro, so sorry if this is total shit. Works for what I needed. This is controlling a list of access rules to different services for doling out access on the controller level globally, or down to an individual action inside that controller. If I rewrote this on newer PG, I'd store the permissions text as a JSON datatype.
CREATE OR REPLACE FUNCTION merge_privileges(key integer, data_controller text, data_permissions text)
RETURNS void AS
$BODY$
BEGIN
LOOP
-- first try to update the key
UPDATE privileges SET controller = data_controller, permissions = data_permissions WHERE user_id = key AND controller = data_controller;
IF found THEN
RETURN;
END IF;
-- not there, so try to insert the key
-- if someone else inserts the same key concurrently,
-- we could get a unique-key failure
BEGIN
INSERT INTO privileges(user_id, controller, permissions) VALUES (key, data_controller, data_permissions);
RETURN;
EXCEPTION WHEN unique_violation THEN
-- do nothing, and loop to try the UPDATE again
END;
END LOOP;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
I prefer to use CTE in postgres to do upserts, this is a non complete example, but you get the general idea:
WITH upsert AS (
UPDATE some_table
SET attrib = 'foo'
WHERE id = 123
RETURNING *
) INSERT INTO some_table ( id , attrib )
SELECT 123 as id, 'foo' as attrib
WHERE 123 NOT IN (SELECT id FROM upsert );
Nope, that's exactly what I use when I really need it, generalized to simple table/PK schemes.[1]
At first I wanted to try and do it within PL/pgSQL (e.g., from your SO link), but I couldn't figure out how to generalize it like I could with my the UPDATE ... INSERT WHERE. (I haven't really studied PL/pgSQL too hard yet.)
I wanted to update a record if it existed, otherwise use insert. So if a name is in the database, update the times I've seen it. If the name does not exist, insert. I was keeping track of the number of times a name is seen.
Without knowing much about your actual application (i.e. treat the following with utmost skepticism!), my instinct would be to separate the fact of observing a name from the fact of the observation count. I would keep a log table and a summary table like this:
CREATE TABLE namelog (
name TEXT
);
CREATE TABLE namesummary (
name TEXT,
seen int
);
Then each time you see a name, log it:
INSERT INTO namelog VALUES (%s)
Then on a regular basis, run a housekeeping/summary process like this:
CREATE TEMPORARY TABLE prevsummary AS
SELECT * FROM namesummary;
TRUNCATE namesummary;
INSERT INTO namesummary
SELECT coalesce(a.name, b.name) as name,
coalesce(a.seen,0) + coalesce(b.seen, 0) as seen
FROM
prevsummary a
FULL OUTER JOIN
(SELECT name, count(*) as seen
FROM namelog
GROUP BY name) b
ON a.name=b.name
;
DELETE FROM namelog;
This of course is grossly space-inefficient, but making the logging process INSERT-only should enable a great degree of concurrency - and you can tune the summary process to run as often as necessary to keep memory/disk usage within your available resources.
If concurrency isn't an issue, then your initial solution is probably fine.
Very nice approach, I didn't think of trying to log names with only INSERT. I see how it handles concurrency. My solution is likely to fail if my database gets bombarded...My app was supposed to be a simple one I made for a few friends and also for me to learn a few things in node / postgres.
For more detail, here is my initial table:
CREATE TABLE names(name_id serial, name text, count integer)
I was planning to access the name count a good numbers of times, so I went with a simple SELECT from names. I don't necessarily need "on demand" counts so having the summary process run 5 or 10 mins would be a better!
I didn't know live data migration from one database version to another had a name, but I have been waiting for it for years. (Since 7.3, in fact.) And if I'm interpreting this right, that feature could also allow for a fairly straightforward way to move data from a running instance to a new one.
Perhaps not yet in the first release, but eventually that too could work.
It involves bouncing the master, but it doesn't involve extended downtime. Or is that not what you meant?
However, the 'logical log streaming replication' that is being implemented looks even better. A small but useful thing is that it works at database granularity, rather than having to replicate the whole installation. The huge thing is that because it works with logical rather than physical representations of changes, it allows replication into a database which is also being updated locally - slaves can be read/write rather than read-only. That itself is not immediately exciting, but the logical consequence of this is that two databases can both accept changes and replicate to each other - 'bi-directional replication', or what in MySQL is called 'multi-master replication'. The holy grail of database clustering!
Thank you for pointing this out, I would have missed it.In fact, thank you everyone who has answered with helpful technical details.
I have a couple of use-cases in mind where logical streaming replication could solve a number of problems. A personal project would benefit from a database-level setup due to easier data migration, a professional one would benefit from bi-directionality and logically straightforward clustering. (I also have some future designs in mind where replicating only subsets of a database would be VERY useful.)
Owing thanks to the replies, I now have some extra reading to do. :)
Well, let's be clear about what's being worked on. The logical replication framework as I understand it is a framework for logical replication frameworks, not for the end user/administrator. I.e. it is there to allow some centralized logic between implementations like Bucardo and Slony, rather than be usable directly.
Logical replication, if you need it, is not some far-off feature from a sysadmin's perspective but something where there are currently at least three major implementations of it to date, namely Slony, Bucardo, and Londiste. Take your pick, set it up, and use it. These three have very different focuses and design decisions associated with them and so what's good for one environment may not be good for another.
So instead of continuing to wait for it, get out and try these solutions.
Just to note, the two main uses of logical replication (as opposed to physical replication which is what the streaming replication offers) is:
1. Live migrations between versions.
2. Replicating only a subset of a database (for example, into a system in a DMZ).
Every major implementation currently supports both of these approaches, but supporting #2 makes using it for #1 significantly harder. It might be nice to see a framework for #1 be built on top of an existing logical replication framework.
The plans for logical replication and the amount of work put into it seems serious so I would expect logical replication to land sooner or later (my guess is in 9.5, but I may be optimistic here).
Several patches of the ground work have already been committed with a bunch of other patches being worked on.
It doesn't matter what feature's you add to postgres, it will never be less appropriate for a project than another database that has some magic feature.
The reason people choose another database is because they favor an architecture where all the things that should be done in the database get done in the middleware. Features are irrelevant since the goal is do everything in the middleware anyway.
Once a person converts to the idea that its faster, safer and more reliable to use the database for what it was designed to do then they have left the jungle and found the road. And all roads lead to postgres.
I really don't care one iota about any of these features. They are all obscure and I avoid obscure features in deference to the guy that has to read my code next (mostly me).
> JSON in postgres is just a container type isn't it?
At the moment JSON is stored in a text field so accessing a JSON field involves parsing the entire text value. If instead the JSON is stored in a parsed binary format then you can more efficiently access individual fields.
If you're only accessing a small set of known fields you can work around the issue by using plv8 to create function indexes on the fields that you'll be using. This doesn't work for arbitrary expressions though. If you want to filter a WHERE clause based on a non-indexed field the entire JSON text will need to be parsed.
Binary JSON storage makes all of this much faster with no change on the user's side. It's all transparent and just faster!
In theory it could also reduce storage space for tuples. Duplicate field names need not be repeated. I don't think it's in scope in the Postgres JSON improvements but it's a possibility.
> Why is everyone expecting it to become document storage?
Document storage in a relational database is really useful and there are plenty of use cases for it.
The standard example I use for JSON (or hstore) usage for Postgres is an audit table. You'd have all the usual audit fields (who/what/when) but you'd also have a "detail" field with event specific details. Using hstore or JSON for this is perfect. Improved support for JSON makes it much easier to provide event specific search.
> If you're only accessing a small set of known fields you can work around the issue by using plv8 to create function indexes on the fields that you'll be using.
You don't need plv8 to create an index on a JSON field, it's supported natively.
Yes my mistake on that. I'm mixing up 9.2 (where you had to do use plv8) and 9.3 (which has the native operator). The rest of the comment still stands though, a parsed representation makes the operator faster and more efficient.
Isn't there an in place update of an existing database without dump and restore now for quite some versions (http://www.postgresql.org/docs/9.3/static/pgupgrade.html)? A tool which keeps the relation files, but builds new system catalogs around them? It had the impression that this tool would be very fast, but a bit risky.
The linked article makes it sound like there is a need for local replication in order to do an update without (much) downtime. Is this just a (less risky!) alternative to the available in place update?
Last time I proposed PostgreSQL over MySQL, what prevented adoption was lack of PostgreSQL support on Amazon's RDS. The time before that, it was that we lacked any DBA who knew PostgreSQL well enough.
I find lack of support for one specific command a very poor reason not to adopt an exceedingly well engineered database. It's like dismissing Oracle because of how ugly your DDL looks using VARCHAR2 types.
In contrast, having to pull products from production just to add indexes to large tables in MySQL was a huge nuisance.
There might be stored procedures etc, but not being an expert means I'd be concerned about performance and correctness. And the point is especially for perf because that's exactly when I want upsert: to save a lot of back and forth when inputting a lot of data. Also, not being standard means things like bulk upload Ruby gems not supporting faster methods of upload because they can't depend on stored procedures in any easy way that I can tell.
I think gleenn is talking about the abstraction known as "upsert". Namely, there exists some syntax for expressing "either insert if it doesn't exist or update if it does exist; determine existence by primary key."
This process can be achieved in Postgres, but I'm not aware of any simple abstraction for it. But I could be wrong. Does there exist a stored procedure that can be generalized to the point of upsert abstractions in other relational databases? (In this thread, I see a stored procedure for a specific table/set of columns.)
Yes, for me this is a showstopper. All MySQL projects I've worked on have REPLACE INTO and/or INSERT … ON DUPLICATE.
Postgres' equivalents of UPSERT that I've found either only work correctly with single-row inserts or are inefficient and complicated.
It's really hard to sell Postgresql to developers of a MySQL project when a simple 1-line query explodes into a complicated stored procedure for apparently no good reason.
> Yes, for me this is a showstopper. All MySQL projects I've worked on have REPLACE INTO and/or INSERT … ON DUPLICATE.
Curious. What is your use case?
From personal experience, I've used "ON DUPLICATE", but we ended up dumping it and making our API better/more secure. This was even before we switched to Postgres.
PostgreSQL didn't become PostgreSQL by doing it the MySQL way.