Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Facebook runs their entire stack using Btrfs [0]. I would encourage anyone who is stuck in the "oh btrfs is so buggy and loses data" mindset (not helped by articles like this [1] that play off btrfs as some half-baked contraption, when it's really btrfs raid that needs a LOT more time to bake) to look into things and realize that large companies (OpenSuse, Redhat, Faceboook) have poured a lot of time to get it to work well.

I don't know about it's multi-disk story (I do use ZFS for that personally), but for single disk options it is great. You get so many of the ZFS benefits (snapshots, rollback, easily create and delete volumes, etc) with MUCH lower memory usage (at least in my own experiments to try this out).

[0] https://lwn.net/Articles/824855/ [1] https://arstechnica.com/gadgets/2021/09/examining-btrfs-linu...



Facebook has stacks of thousands of spare nodes ready at any moment to replace a failed node. All essential data will be replicated across many different boxes so if a box fails you just replace it with a fresh node and replicate the data there.

This is much different to the consumer usecase where computers are pets and not cattle. A failed filesystem the night before you need to turn in your thesis may have a much larger impact on your life.

Another thing to consider is that Facebook runs btrfs on enterprise hardware (including SSDs with battery backups) which is going to be much more reliable than some chromebook which lives in the bottom of your backpack that you bring on transit every day.

Finally, I will say that the copy on write features of btrfs can result is some wildly different behaviour based upon how you use it. You can get into some very bad pathological cases with write amplification, and if you run btrfs on top of LUKS it can nearly be impossible to figure out why your disk is being pegged despite very little throughput at the VFS layer.


The ChromeOS Linux dev VM uses btrfs by the way.


So much FUD in this discussion. Christ Mason talked publicly that they use the cheapest SSDs they can find (even worse things than what he would be willing to put in his laptop), and that they investigate every instance of btrfs corruption. You're saying the exact opposite of the main btrfs guy at Facebook. I wonder who is right...


Who is right, one guy whos reputation relies on something not breaking or a bunch of end users who report the thing broke for them?

I experienced issues with write amplification within the past few months in Ubuntu 22 so it isn’t like all the issues are gone. I do agree that there are less issues now than there was before, but I will still say that btrfs still breaks or behaves unexpectedly much more often than ext4 or xfs.


Meta does a lot of things that don't scale for reliable/trustworthy systems and aren't suitable for all use-cases. (I also used to work there too.)

ZFS is only reliable where it was battled-tested: on Solaris. ZoL has been infinitely tinkered with and smashed up that it's nothing like running a Thumper as a NAS.

XFS + mdadm on Linux is, without a doubt, far more reliable than ZoL. Ask me how I know. I have the scars to prove it.


ZFS on Linux is absolutely fine in high-performance and critical computing applications.

I also owned a Thumper and Thor running Solaris in 2009. Much prefer Linux and the hardware solutions today.


ZFS has been plenty battle tested on FreeBSD.


Not hardly, and not in the way you think. They replaced their arguably purer ZFS port to replace it with ZoL. As such, it's nowhere near as tested and proven as existing solutions like ext4 and xfs Redhat has deployed to millions of machines for decades. ZFS has too many religious fanboys who hype it without considering that boring and reliable are less risky than betting on code that hasn't had nearly the same scale of enterprise experience.


I am well aware of that change.

What specific problems are there with the ZFS implementation on FreeBSD? You claim it is not battle tested, I find it to be rock solid..


> You claim it is not battle tested, I find it to be rock solid

Those two things are not mutually exclusive.


Never said they were


Did you or other XFS users try out stratis?


tell me your story


Yeah, my setup too, XFS + mdadm (+ eventually LVM2). Rock solid. It might not have HW raid performance, but in terms of stability, flexibility and recovery its absolutly unbeatable!


I am stuck in the btrfs-is-buggy mindset precisely because it managed to lose my root partition on a single disk machine. It might also have raid problems, but not exclusively.


Me too. Repeatedly, at least once a year, on 3 different machines.

The cause? Filling up the filesystem. Why? Because of OS snapshots.

(Aside: why can they fill it? Because it doesn't give a straight answer to `df -h`. Why not? Because of snapshots.)


That happened recently? A few years ago they added a reserved area used for emergency purposes that should solve situations like that. Can't say I've run into these problems, although I don't tend to run btrfs very heavily because performance becomes unacceptable long before that due to CoW.

https://btrfs.readthedocs.io/en/latest/btrfs-filesystem.html

(Look for "GlobalReserve")


> That happened recently?

It happened to me repeatedly on both openSUSE Leap and openSUSE Tumbleweed during the 4 years I worked for SUSE: 2017-2021.

The `df` command doesn't work: it does not give reliable info. That alone disqualifies this FS for me.

The `fsck` equivalent does not work: every time I have tried, it corrupts volumes into unreadability.

Those 2 things are hard requirements for me.

I raised this internally as significant issues. They were dismissed.


> I don't know about it's multi-disk story (I do use ZFS for that personally), but for single disk options it is great.

I can reliably, across vendors and drives, break RAID10 on BTRFS where MD+LVM are totally fine. Simply pull power. Discovered this when building out my latest workstation.

I haven't tried other configurations; after finding this pattern I decided to leave BTRFS for single-disk configurations where I want CoW


I use btrfs a lot but I'm not sure if I'd use it for production servers. The I/O bandwidth is just a lot lower and I get weird latency problems on desktop Linux when BTRFS is very busy that I don't get on other file systems. Then again, I probably wouldn't use ZFS for anything but a NAS setup either.


The default is to coalesce trim requests into large batches and issue them once per minute or so. Most other filesystems don't use online trim. This can cause latency spikes. If you'll ever decide to try it out again, try disabling online trim.

https://btrfs.readthedocs.io/en/latest/Trim.html


> Facebook runs their entire stack using Btrfs

Yeah, and when I was there machines would run out of disk space at 50% usage and it took months to figure out why. In the mean time, they'd just reimage the machine and hope. I don't recall any issues with data loss, but it didn't have the air of reliability.

But my team was weird at FB, our uptimes of 45 days were way above the average, and we ran into all sorts of things because we operated outside the norm.


Is the structure of 800gb btrfs containers mentioned in [0] how user data is stored? Just sharded across billions of containers?


Last time they talked about it (that I know of -- when Fedora was contemplating using btrfs and asked Chris Mason et al for their opinion), FB were running databases on xfs and were looking for ways to place them on raw disks for maximum performance. So not the entire stack.


Why is this grey/down? Is there something factually incorrect?

Edit: it's less grey now.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: