I'm working on bcachefs, a next generation Linux filesystem descended from the bcache codebase with a long list of features:
- Copy on write (COW) - like zfs or btrfs
- Good performance - significantly better than existing copy on write filesystems, comparable to ext4/xfs
- Metadata and data checksumming
- Multiple devices, including replication and other types of RAID
- Scalable - has been tested to 50+ TB, will eventually scale far higher
- Already working and stable, with a small community of users
The Linux filesystem situation is in a bit of a bad place these days:Currently, we have:
- ext4, which works - mostly - but is showing its age. The codebase terrifies most filesystem developers who have had to work on it, and heavy users still run into terrifying performance and data corruption bugs with frightening regularity. The general opinion of filesystem developers is that it's a miracle it works as well as it does, and ext4's best feature is its fsck (which does indeed work miracles).
- xfs, which is reliable and robust but still fundamentally a classical design - it's designed around update in place, not copy on write (COW). As someone who's both read and written quite a bit of filesystem code, the xfs developers (and Dave Chinner in particular) routinely impress me with just how rigorous their code is - the quality of the xfs code is genuinely head and shoulders above any other upstream filesystem. Unfortunately, there is a long list of very desirable features that are not really possible in a non COW filesystem, and it is generally recognized that xfs will not be the vehicle for those features.
- btrfs, which was supposed to be Linux's next generation COW filesystem - Linux's answer to zfs. Unfortunately, too much code was written too quickly without focusing on getting the core design correct first, and now it has too many design mistakes baked into the on disk format and an enormous, messy codebase - bigger that xfs. It's taken far too long to stabilize as well - poisoning the well for future filesystems because too many people were burned on btrfs, repeatedly (e.g. Fedora's tried to switch to btrfs multiple times and had to switch at the last minute, and server vendors who years ago hoped to one day roll out btrfs are now quietly migrating to xfs instead).
- zfs, to which we all owe a debt for showing us what could be done in a COW filesystem, but is never going to be a first class citizen on Linux. Also, they made certain design compromises that I can't fault them for - but it's possible to better. (Primarily, zfs is block based, not extent based, whereas all other modern filesystems have been extent based for years: the reason they did this is that extents plus snapshots are really hard).
So, many people would agree that there's room for something new - but why bcachefs?
- It's stable:
One of the biggest difficulties with creating any new filesystem today is that a POSIX filesytem is a huge amount of functionality - and until you have nearly all of it implemented, you don't have anything for people to test. They don't lend themselves to incremental, bottom up development - instead, there's a huge amount of pressure to do everything at once which many decades of experience has taught us is not the ideal way to develop software. This is primarily what went wrong with btrfs.
bcachefs's huge advantage is that bcache already was the bottom half of a filesystem - and it was fast, stable, and had a userbase large enough to prove that and a test suite that could exercise any changes to the existing functionality that were made for fileystem support.
As a result, so far only a few mostly trivial bugs have been found since I and others started using it in anger. I, the author, have been using it for my root filesystem on the laptop I use for bcachefs development for several months now and it's been surprisingly uneventful. It's been solid.
- It's fast: see these benchmarks. Those are a bit old, even since then there's been some significant improvement on a few of those.
Also, what those benchmarks don't show is that bcachefs is very much designed with tail latency in mind. Tail latency has been the bane of ext4 users for many years - dependencies in the journalling code and elsewhere can lead to 30+ second latencies on simple operations (e.g. unlinks) on multithreaded workloads. No one seems to know how to fix them.
In bcachefs, the only reason a thread blocks on IO is because it explicitly asked to (an uncached read or an fsync operation), or resource exhaustion - full stop. Locks that would block foreground operations are never held while doing IO. While bcachefs isn't a realtime filesystem today (it lacks e.g. realtime scheduling for IO), it very conceivably could be one day.
- It has a small, clean codebase:
bcachefs has a codebase smaller than ext4 while delivering most of the features of btrfs. Refactorings to make the codebase cleaner and easier to understand are done aggressively. I want bcachefs to outlast me - by making it the easiest filesystem for other developers to understand and work on.
- It has, either finished or in progress, features we all need:
The zfs folks have made the argument for data checksumming better than I can - that alone would be a sufficient reason for a good COW filesystem.
Beyond that though, if you can think of a reasonable feature for a filesystem to have - e.g. encryption, compression, snapshots, send/receive - most likely bcachefs either has plans and provisions for it or it's already been added. The original bcache design has turned out to be wonderfully flexible, and we haven't yet run into a feature where we've had to say "no, that'd be to ugly to add".
Status:Right now, all the core POSIX filesystem functionality is done and working, and there aren't any known outstanding bugs in that core area (it's been passing xfstests for many months, and I and others have been successfully using it): you should be able to use it for your root filesystem, on a single device, without issues.
- Checksumming, data and metadata: done (by default, crc32c is enabled for both data and metadata - keep this in mind if you're running benchmarks).
- Compression (lz4, gzip): working and stable - unfortunately, disk space accounting for compressed data isn't finished yet so enabling compression doesn't yet allow you to fit more data in a filesystem. There's also some performance improvements I still want to do.
- Multiple devices, including tiering (caching): working, but not currently well tested so you may run into bugs - I wouldn't recommend using it with data you care about.
- Replication - 80% completed, but failure paths (rereading from a different replica, rereplicating data on drive failure) aren't completed so it's not terribly useful yet.
- Erasure encoding (Reed-Solomon, i.e. RAID5/6): Not yet started
- Encryption: Not yet started
- Send/receive: On disk format work is done - we have a version number field for everything we index so send/receive itself should be straightforward, but send/receive itself is not yet started.
- Snapshots: In progress. bcachefs's snapshot implementation is going to be significantly more capable than any competing implementations. When it's done, you should be able to take snapshots via a cron job every 5 minutes if you so desire - without it adversely affecting performance, and with good space efficiency.
- SMR (shingled) drive support - SMR drives achieve higher storage density at the cost of disallowing random writes - which means the drives themselves have to implement copying garbage collection to be used with existing filesystems. Doing the GC in the filesystem would give better performance, as the filesystem is aware of things like files and what data is live or not. bcachefs already has the copy GC algorithms implemented, so native SMR support is mostly a matter of hooking the shim layer to teach bcachefs about the drive layout.
- Raw flash support: Like SMR, raw flash chips don't allow random writes, thus all SSDs have a flash translation layer (FTL) that implements a logical to physical mapping as well as allocation and garbage collection - functionality that bcachefs already has.
The appeal of this is that existing FTLs are large, complicated black boxes that are in the performance fast path (since they implement copying garbage collection, they affect latency in unpredictable ways) and they're impossible for customers to debug or even generally understand the behavior of. This has already been sufficient motivation for projects to work on open source, host side FTLs that target raw flash via a standard interface, so there's already some precedent for what we'd like to do in bcachefs.
- PMEM (persistent memory) support: bcachefs will definitely get basic PMEM support (DAX, in Linux kernel terminology), equivalent to what other filesystems have done. We'll also be adding full data journalling for super fast syncs when only the journal is on PMEM.
Further off, we intend to add support for accessing the btree on PMEM directly, instead of reading it into DRAM first - this will significantly benefit very large filesystems (perhaps with a fast PMEM tier and a slower, but much larger tier of flash or SMR devices) where filesystem metadata no longer fits in DRAM.