How to rewrite files in Linux

blasdel · on March 8, 2009

Written by Ted Ts'o, the author of ext4 (among other things), who's been a linux kernel developer since the very beginning.

Even more approaches to rewriting: http://bitworking.org/news/390/text-editor-saving-routines

etal · on March 8, 2009

The whole bug thread is good information, but: if application developers actually follow approach #3 en masse, won't that massively slow down the system? I see some cargo-cult potential in fsync-after-every-write here. The emacs example may be correct, but it has a couple steps that require extra thought depending on the context of the rewrite. Where data consistency and security are less necessary, example #2 seems just fine, assuming rename() is atomic and a power failure would just leave the old contents in the file.

For Gnome and KDE, Ted suggests using BDB or an idealized version of SQLite instead of the current dot-directories, but I don't imagine that change being implemented any time soon -- Firefox still has some lingering problems from that transition.

nkurz · on March 9, 2009

"example #2 seems just fine, assuming rename() is atomic and a power failure would just leave the old contents in the file."

Unfortunately, the 'bug' is that this does not happen. Instead, the truncate hits disk before the new data is written, so that a power loss leaves the file empty rather than merely out-of-date. This behaviour is different on ext4 than on ext3, hence it has only recently become an issue.

To my untrained eye, regardless of what the POSIX standard 'allows', this is a bug that should be fixed. On the bright side, it looks like there are patches ready to be applied that will change the behaviour to match ext3, which is as you describe. If this happens, #2 would remain a relatively safe choice.

jbert · on March 9, 2009

> I see some cargo-cult potential in fsync-after-every-write here.

Well, we currently effectively have cargo-cult no-fsync, in that application writers who don't consider the matter get the fast-but-risky behaviour.

I think I'd be happy for naive application writers to get safe-and-slow as a default, learning that they can get fast-but-risky over time (as they, hopefully, become aware of when to use the two approaches).

That said, what that would really achieve is an overall impression that "writing files under Linux is slow", so perhaps the current situation is 'best'.

swillden · on March 8, 2009

Another option: use mmap() and rewrite it in place, using ftruncate() if necessary to grow/shrink the file, and msync() to flush the changes.

This is also risky because it could leave the file in an inconsistent, partially-modified state. However, in practice it seems to work really well, and Google turns up a striking lack of articles discussing the risks of inconsistency due to use of mmap.

I'd like to learn more about the risks of this approach, since I'm using it to selectively rewrite portions of a very important file in a project of mine. I'm pretty sure that my usage pattern is safe. The writes are small enough that they nearly always affect only one disk block and never touch more than two, and I'm careful to msync() at the right places (actually mmap.mmap.flush(), since I'm using Python), but I'm interested in learning about potential issues. This file is too large to make copying it convenient, but I'll do that if necessary.

nkurz · on March 9, 2009

"Another option: use mmap() and rewrite it in place, using ftruncate() if necessary to grow/shrink the file, and msync() to flush the changes."

This is a fine option, but for the purpose of this bug I don't think there is any advantage over using stdio (with seeks if necessary) and calling fsync() after writing. It's a great technique, but functionally equivalent to just saying "avoid O_TRUNCATE".

Unless you have some nifty way of upgrading from a copy-on-write map(MAP_PRIVATE) to something that gets atomically written to disk?

utx00 · on March 8, 2009

might using a berkdb (as suggested) make more sense for your case? it can be devilish tricky getting these things to work right.

swillden · on March 8, 2009

Possibly. The ability to view/tweak the file in a text editor is something I find very convenient, but it's less important than consistency.

jrockway · on March 9, 2009

You can always write a wrapper to start a transaction, extract the file from the database, exec $EDITOR on it, replace the database data with the file, and commit the transaction. That is very safe, and very easy to do. (And, you can make the txn apply to multiple files, which is quite useful.)

We use this technique in the command line utility for KiokuDB.

wmf · on March 8, 2009

For approach #2, can you lose the file completely or only the new version? In many cases it would be acceptable if the old version was still there after a crash (ACI semantics).

pmjordan · on March 8, 2009

As I understand it, you can lose both versions in #2. The rename drops the refcount of the old file to 0, freeing its inode and associated file blocks. If you're lucky, they haven't been overwritten and can still find them. Not easily from an app though. On the other hand, the new file might not have been flushed to disk yet.

I suspect this could be solved fs-side using a journal that doesn't just store metadata (as most do) but also file content. Slow, though, as files are effectively written to disk twice.

wmf · on March 8, 2009

Ugh, that is bad if true. A filesystem shouldn't commit a rename to disk before the file being renamed; that is common sense causality. Especially since (AFAIK) #2 is totally safe under ZFS, so Linux is behind yet again.

glymor · on March 8, 2009

My guess is the rename must occur immediately or he would have covered it.

But even if it doesn't the write and the rename are not correlated so it's still not safe. e.g. the move does not occur iff the write completes.

Is there any way to queue the move as a system call?

moe · on March 8, 2009

I like the robot analogy.

Applies to other programming problems as well, I think I'll use this when I next time see someone "optimizing" SQL database transactions.

jws · on March 8, 2009

guilty of #2. I'll spend part of this afternoon putting in fsyncs().

Pseudo related: I wish I could have anonymous temporary files and associate them with a name later. Temporary name handling is just bugs waiting to happen. Even if you do it securely, how many programmers have a way of cleaning up spurious debris that might be left after an inelegant termination?

tptacek · on March 8, 2009

You create secure temp files with mkstemp(). It's built into the standard C library.

Mad respect to Ts'o, but most Unix apps --- regardless of what he thinks of the developers --- don't do this fsync dance. There isn't an epidemic of corrupted files because of it. And there have been notable cases where syncing has caused calamitous performance problems; for instance, look up what the Apple developers did to sqlite.

jws · on March 8, 2009

The first application I went to fix is in PHP, there does not seem to be an fsync() call in PHP. I may not succeed this afternoon.

vlisivka · on March 9, 2009

The time to invent transactions in filesystem?

fd=open("...");

transaction_begin(fd);

write(fd, ...); close(fd);

rename(fd);

transaction_end(fd);

zafarali · on March 8, 2009

critic · on March 9, 2009

I didn't RTFA, but the answer seems obvious:

    cat > file_to_be_rewritten

HTH