- Interior btree node updates are now journalled; removing the need for btree writes to be FUA
- Interior btree node updates are now fully transactional, we no longer have to do any metadata scanning after unclean shutdown
- Btree key cache code has been merged
- Major rework of journal replay finally finished
- Lots of bug fixing
So, some background:
Historically, the btree and the journal in bcache/bcachefs have been fairly separate entities; the btree has always been internally consistent on disk without anything from the journal, and the journal just contained updates to leaf nodes, and journal replay just meant redoing all those updates, in the same order as they occur in the journal.
The downside of this was that any time we updated an interior btree node (because we split or compacted a leaf node), we'd have to write out the update to the interior node right away - and it meant we had to use FUA (force unit access; it means bypass the write cache) for all btree node writes.
That's a disadvantage because consumer drives tend to either not support FUA (meaning it has to be emulated by the block layer with cache flushes), or they internally flush the whole cache when they receive a FUA write - or worse, have buggy FUA support. It turns out other filesystems have been bitten too by drives with buggy FUA support, and some of the bug reports I'd been seeing seemed to indicate that that was happening to us too, so several months ago I finally got around to a long contemplated project - journalling updates to interior btree nodes, not just leaf nodes.
The changes to the interior btree update code went pretty smoothly, as well as tweaking journal replay to replay updates to interior nodes first - but, at the time I missed the full implications of having to start the allocator threads before journal replay had made the btree consistent again. Oops.
So that took awhile to sort out - hence the long delay in updates; recovery from unclean shutdown was somewhat broken for quite awhile. But, at long last, it's finished: the last major piece required was merging in the btree key cache code, which I'd been working on for quite awhile but hadn't quite finished.
The btree key cache code acts as a write cache for the btree, for keys that are going to be updated frequently in a short span of time (e.g. inodes and keys in the alloc btree). Normally when we do a btree update, we update the journal and the btree at the same time - but there's no real requirement in e.g. the on disk format that we update the btree at the same time, the btree just has to be updated before releasing the pin on the relevant journal entry. This lets us skip the relatively expensive btree traversal, and helps with lock contention since a single btree leaf node can hold many keys.
This helps us with journal replay because it means the allocator threads can do their thing without actually updating the alloc btree - they just update keys in the btree write cache, which will be flushed back to the btree by journal reclaim; all we have to do now is not start journal reclaim until we've finished replaying all the updates to interior btree nodes.
So, with all that done, there should be some performance improvements due to both not doing FUA btree node writes anymore, and also having the btree key cache enabled for the alloc btree. It's not enabled for the inodes btree yet - that patch still needs a bit more work.
Next up: I think I'm going to see what I can get done with erasure coding.
And keep the bug reports coming!