The whole bug thread is good information, but: if application developers actually follow approach #3 en masse, won't that massively slow down the system? I see some cargo-cult potential in fsync-after-every-write here. The emacs example may be correct, but it has a couple steps that require extra thought depending on the context of the rewrite. Where data consistency and security are less necessary, example #2 seems just fine, assuming rename() is atomic and a power failure would just leave the old contents in the file.
For Gnome and KDE, Ted suggests using BDB or an idealized version of SQLite instead of the current dot-directories, but I don't imagine that change being implemented any time soon -- Firefox still has some lingering problems from that transition.
"example #2 seems just fine, assuming rename() is atomic and a power failure would just leave the old contents in the file."
Unfortunately, the 'bug' is that this does not happen. Instead, the truncate hits disk before the new data is written, so that a power loss leaves the file empty rather than merely out-of-date. This behaviour is different on ext4 than on ext3, hence it has only recently become an issue.
To my untrained eye, regardless of what the POSIX standard 'allows', this is a bug that should be fixed. On the bright side, it looks like there are patches ready to be applied that will change the behaviour to match ext3, which is as you describe. If this happens, #2 would remain a relatively safe choice.
> I see some cargo-cult potential in fsync-after-every-write here.
Well, we currently effectively have cargo-cult no-fsync, in that application writers who don't consider the matter get the fast-but-risky behaviour.
I think I'd be happy for naive application writers to get safe-and-slow as a default, learning that they can get fast-but-risky over time (as they, hopefully, become aware of when to use the two approaches).
That said, what that would really achieve is an overall impression that "writing files under Linux is slow", so perhaps the current situation is 'best'.
Another option: use mmap() and rewrite it in place, using ftruncate() if necessary to grow/shrink the file, and msync() to flush the changes.
This is also risky because it could leave the file in an inconsistent, partially-modified state. However, in practice it seems to work really well, and Google turns up a striking lack of articles discussing the risks of inconsistency due to use of mmap.
I'd like to learn more about the risks of this approach, since I'm using it to selectively rewrite portions of a very important file in a project of mine. I'm pretty sure
that my usage pattern is safe. The writes are small enough that they nearly always affect only one disk block and never touch more than two, and I'm careful to msync() at the right places (actually mmap.mmap.flush(), since I'm using Python), but I'm interested in learning about potential issues. This file is too large to make copying it convenient, but I'll do that if necessary.
"Another option: use mmap() and rewrite it in place, using ftruncate() if necessary to grow/shrink the file, and msync() to flush the changes."
This is a fine option, but for the purpose of this bug I don't think there is any advantage over using stdio (with seeks if necessary) and calling fsync() after writing. It's a great technique, but functionally equivalent to just saying "avoid O_TRUNCATE".
Unless you have some nifty way of upgrading from a copy-on-write map(MAP_PRIVATE) to something that gets atomically written to disk?
You can always write a wrapper to start a transaction, extract the file from the database, exec $EDITOR on it, replace the database data with the file, and commit the transaction. That is very safe, and very easy to do. (And, you can make the txn apply to multiple files, which is quite useful.)
We use this technique in the command line utility for KiokuDB.
For approach #2, can you lose the file completely or only the new version? In many cases it would be acceptable if the old version was still there after a crash (ACI semantics).
As I understand it, you can lose both versions in #2. The rename drops the refcount of the old file to 0, freeing its inode and associated file blocks. If you're lucky, they haven't been overwritten and can still find them. Not easily from an app though. On the other hand, the new file might not have been flushed to disk yet.
I suspect this could be solved fs-side using a journal that doesn't just store metadata (as most do) but also file content. Slow, though, as files are effectively written to disk twice.
Ugh, that is bad if true. A filesystem shouldn't commit a rename to disk before the file being renamed; that is common sense causality. Especially since (AFAIK) #2 is totally safe under ZFS, so Linux is behind yet again.
guilty of #2. I'll spend part of this afternoon putting in fsyncs().
Pseudo related: I wish I could have anonymous temporary files and associate them with a name later. Temporary name handling is just bugs waiting to happen. Even if you do it securely, how many programmers have a way of cleaning up spurious debris that might be left after an inelegant termination?
You create secure temp files with mkstemp(). It's built into the standard C library.
Mad respect to Ts'o, but most Unix apps --- regardless of what he thinks of the developers --- don't do this fsync dance. There isn't an epidemic of corrupted files because of it. And there have been notable cases where syncing has caused calamitous performance problems; for instance, look up what the Apple developers did to sqlite.
Even more approaches to rewriting: http://bitworking.org/news/390/text-editor-saving-routines