I would tentatively disagree with raising the bar and describe it using my favor...

insaneirish · on Sept 17, 2013

> For example, if you want to do something software raid-ish, ZFS has the philosophy that should be done at the filesystem layer, not as a virtual device like every other linux filesystem, ever.

Not exactly true. RAID and mirroring logically sit at the zpool layer, and therefore anything on top of a given zpool has the zpool's RAID/mirror characteristics. This may be a filesystem, but it could also mean a zvol too. A zvol would be analogous to your /dev/md0 block device in that you could put a FAT32 filesystem on top of a zvol and still benefit from the underlying redundancy, parity, and checksumming features of ZFS.

Addendum: Strictly speaking, redundancy (RAID or mirror) configuration is on a vdev, and a zpool is comprised of one or more vdevs over which data is striped.

ansible · on Sept 17, 2013

ZFS shoves all that inside the filesystem. [...] I've had LVM and software RAID and all that for many years on existing linux stuff.

I suggest you study the design of ZFS and also NetApp's WAFL to understand why they are combining what have traditionally been separate layers into one larger system.

The short version is that this cross-cutting of layers enables substantial optimizations and new features that aren't possible when everything is kept to strict interfaces.

VLM · on Sept 18, 2013

But your conclusion is basically what I wrote, that one design is monolithic and one design is modular.

Yes in theory you could probably come up with a weird pathological scenario where a monolithic design is slower and a modular design is faster. But that usually doesn't happen.

Usually, turning a modular design into a monolithic design for a tiny performance gain turns into an epic disaster/mistake. Maybe the whole ZFS thing will be an exception. Probably not.

ansible · on Sept 19, 2013

Yes in theory you could probably come up with a weird pathological scenario where a monolithic design is slower and a modular design is faster. But that usually doesn't happen. [...] Usually, turning a modular design into a monolithic design for a tiny performance gain turns into an epic disaster/mistake.

Well, think about this. Suppose you're running RAID-1 with two drives, and you've got some filesystem (maybe ext4, but that doesn't matter) running on top of that. You create one huge file, and then a little while later you delete it. And right after that, one of the disks dies, and you replace it.

In this case, your RAID layer doesn't know that most of the data written to the original drive is junk, and that the only really important bits are some inodes and directory entries consuming a few MB near the end of the disk. It has to re-mirror the entire drive from the original to the replacement before they are in-sync again and you are fully protected. Even with modern drives, that leaves a large window of time that you're not protected.

If, on the other hand, your RAID layer has a thicker interface to the filesystem than just a dumb block store, it can just mirror the little bit of metadata, and within seconds you're in sync again and fully protected.

That's just one example. There are many more. Go read the stories about people complaining about RAID-5 and RAID-6 performance.

DavidPlumpton · on Sept 18, 2013

I don't know much about RAID and other systems, but does anything other than ZFS read data from disk, determine that it was corrupt, go to another disk and get the same data and verify that the hash is good and then automatically fix the corrupt data? That seems like a killer feature to me.

barrkel · on Sept 18, 2013

Hmmm. I run NTFS and truecrypt on top of raidz zpool allocations (zvols, virtual block devices) all the time, block storage doled out over iSCSI from my ageing Nexenta box.

I do think you're too quickly dismissing the advantages of the fs being aware of the underlying block storage provider, to the point of being able to grow elastically from common storage rather than preallocate at partition time.

I'd much rather the whole md layer be replaced by zpool even if zfs itself is not used.