Tiering is dead; long live disk groups
 
The new disk groups-based code for configuring data placement has been merged, and the notion of configuring disks into "tiers" has been removed. If you have an existing filesystem that uses tiering, you'll have to configure the new interfaces.

The reasoning behind the change was that a "disk tier" wasn't really a thing - it was just a hint to a couple different parts of the IO subsystem as to where they should put data and how they should move it around. Instead of having one hint - that was effectively a global setting - we're now exposing settings that correspond to what we're actually doing. As a bonus, these new settings can be overridden for individual files/directories.

The things tiering actually did, and the new equivalents, are:
* It was a hint to read path - it told the read path which disks were faster so it could prefer those. Now, we instead just track the recent IO latency to each device so we know which devices are faster.

* Foreground writes, and metadata allocations, would prefer to use devices in the fastest tier. This is now controlled with the foreground_target option.

* In the background, the tiering thread would look for dirty data on the faster tier and copy it to the slower tier (leaving the original data there, but marking it as cached). This is now controlled with the background_target option; if that option is set, in the background that data will be copied to disks in background_target (as before, leaving the original data in place but marking it cached).

* When reading, if there is no copy of the data we're reading in the fastest tier, we'd write a cached copy of that data to the fastest tier. This is now controlled with promote_target.

So, the old recipe to format /dev/sda and /dev/sdb using /sda as a writeback cache was:

bcachefs format --tier 0 /dev/sda --tier 1 /dev/sdb

The new equivalent is:

bcachefs format /dev/sd[ab] --foreground_target /dev/sda --background_target
/dev/sdb --promote_target /dev/sda

The new options can also be changed at runtime, via sysfs in the /sys/fs/bcache/*/options dir. (They can't be specified at mount time like most other options because when mount options are parsed we don't have what we need to look up disk groups).

Disk groups come in because we need a new way to group disks, and a way for the new settings to refer to more than one disk. A disk can be in at most one "disk group", and you refer to disk groups by label. You don't need to bother with disk groups if you don't actually need to refer to more than one disk - all the new settings that take disk groups can also be passed an individual disk.

The example recipe using disk groups would be:

bcachefs format \
--group ssd /dev/sda /dev/sdb \
--group hdd /dev/sdc /dev/sdd \
--foreground_target ssd \
--background_target hdd \
--promote_target ssd

Note that you don't have to use the options this way - e.g. you could specify /dev/sdb for the foreground target and /dev/sda for the promote target to use /dev/sda as a writearound cache.

There's still more options we'll need to add - for example, we don't yet have a way to specify writethrough caching. For that we need a way to specify that foreground writes should write dirty data to one target, and also a cached copy to another target. But we now have a lot of the infrastructure in place to do lots of interesting things.  Other notes, gotchas:

Foreground_target doesn't restrict foreground writes to only use that target; if the device(s) in that target are full the allocation will fall back to other devices in the filesystem - for writethrough caching, we'll need some way to restrict a device to only contain cached data. We do have a --data_allowed option that can be used to restrict the types of data a device can contain (data, btree or user data) - but it can't (yet) be used to restrict a device to only contain cached data.

We don't yet have a mechanism for caching metadata. For writeback caching this is no big deal - you'll probably want your metadata to always live on the fast device anyways - but for writethrough caching this is a bit of a problem (though probably not a huge deal in practice, since bcachefs metadata is fairly compact so it should mostly be cached in RAM).

There was an option added recently, background_compression, to specify that data should be compressed or recompressed with that algorithm in the background - but when the option was added, nothing would actively look for data to be recompressed, it would only be used when tiering or copygc were moving the data around. The new rebalance thread (that replaces the old tiering thread) does actively compress data that isn't compressed with the algorithm specified by the background_compression option.

And, probably the biggest one to be aware of - the ratelimiting algorithm for the rebalance thread is definitely going to need more work. The one big advantage of the tiering approach is that we always had an accurate count of how much work there was for tiering to do - this is no longer really possible (in particular because all these options can be changed at runtime).

So, while you're testing the new functionality please keep an eye on it and report any misbehavior, e.g. not keeping up or spinning and using too much cpu - you can use bcachefs fs usage to see how much data is on all the component devices.