This mirrors what we already do during allocation. We lock the control
device first, and then release the block device and then delete it.
This makes things substantially more robust as long all participants do
such locking: we won't attempt to delete a block device somebody else
already is using.
Back when I wrote this code I wasn't aware of BLKPG and what it can do.
Hence I came up with this hack to attach an empty file to delete all
partitions. But today we can do better with BLKPG: let's just explicitly
remove all partitions, and then try again.
Let's rework how we lock loopback block devices in two ways:
1. Lock a separate fd, instead of the main block device fd. We already
did that for our internal locking when allocating loopback block
devices, but do so for the exposed locking (i.e.
loop_device_flock()), too, so that the lock is independent of the
main fd we actually use of IO.
2. Instead of locking the device during allocation of the loopback
device, then unlocking it (which will make udev run), and then
re-locking things if we need, let's instead just keep the lock the
whole time, to make things a bit safer and faster, and not have to
wait for udev at all. This is done by adding a "lock_op" parameter to
loop device allocation functions that declares the initial state of
the lock, and is one of LOCK_UN/LOCK_SH/LOCK_EX. This change also
shortens a lot of code, since we allocate + immediately lock loopback
devices pretty much everywhere.
Whenever we release a loopback device, let's first synchronously delete
all partitions, so that we know that's complete and not done
asynchronously in the background. Take a BSD lock on the device while
doing so, so that udev won't make the devices busy while we do this.
No actual code changes, just splitting out of some dev_t handling
related calls from stat-util.[ch], they are quite a number already, and
deserve their own module now I think.
Also, try to settle on the name "devnum" as the name for the concept,
instead of "devno" or "dev" or "devid". "devnum" is the name exported in
udev APIs, hence probably best to stick to that. (this just renames a
few symbols to "devum", local variables are left untouched, to make the
patch not too invasive)
No actual code changes.
attach_empty() file takes a BSD file lock on the device, and we really
should release that before going to sleep. hence explicitly close the
block device before the sleep instead of relying on _cleanup_ to close
it after the sleep.
This doesn't really make any real difference, given the file should only
contain a single line. But it's conceptually more correct to just remove
the trailing newline/whitespace then the whole lines coming after that.
i.e. if the file actually contains more lines than one, this should
probably be considered an error.
The LoopDevice object supports a shortcut: if the backing fd we are
supposed to create a loopback device of refers to a
block device alrady then we'll use it as is – if we can – instead of
setting up an unnecessary loopback device that would be pretty much
the same as its backing device.
Previously, when doing this we'd just dup() the original backing fd and
use that. But that's problematic in case O_DIRECT was set on the fd,
since we'll keep that flag set on our copy too, which means we can't do
simple, regular IO on it anymore.
Thus, let's reopen the inode in this case with the exact access flags
we'd apply if we'd actually allocate and open a new loopback device.
Fixes: #21176
The whole reason loop_device_make_internal() exists (as opposed to just
loop_device_make()) is to avoid mangling the loop flags value/call
getenv twice. Hence let's actually call it when we already mangled the
flags value.
When loop devices are re-used, the disk sequence number is increased.
Parse it when creating a loop device and store it.
The kernel will never return DISKSEQ=0, so use it to signal that it's
not supported by the current kernel.
This is similar to the preceding work to store the uevent seqnum, but
this stores the CLOCK_MONOTONIC timestamp.
Why? This allows to validate udev database entries, to determine if they
were created *after* we attached the device.
The uevent seqnum logic allows us to validate uevent, and the timestamp
database entries, hence together we should be able to validate both
sources of truth for us.
(note that this is all racy, just a bit less racy, since we cannot
atomically attach loopback devices and get the timestamp for it, the
same way we can't get the uevent seqnum. Thus is shortens the race
window, but doesn#t close it).
Later, this will allow us to ignore uevents from earlier attachments a
bit better, as we can compare uevent seqnums with this boundary. It's
not a full fix for the race though, since we cannot atomically determine
the uevent and attach the device, but it at least shortens the window a
bit.
Previously, loop_device_make() would return the device fd in one success
code path, but not the other (where' we'd just return 0).
loop_device_open() returns it in all cases.
Hence, let's clean this up, and make sure in all success code paths of
both functions we return it (even though it strictly speaking is
redundant, since we return it in LoopDevice anyway, and currently noone
actually relies on this).
Let's try to make collisions when multiple clients want to use the same
device less likely, by sleeping a random time on collision.
The loop device allocation protocol is inherently collision prone:
first, a program asks which is the next free loop device, then it tries
to acquire it, in a separate, unsynchronized setp. If many peers do this
all at the same time, they'll likely all collide when trying to
acquire the device, so that they need to ask for a free device again and
again.
Let's make this a little less prone to collisions, reducing the number
of failing attempts: whenever we notice a collision we'll now wait
short and randomized time, making it more likely another peer succeeds.
(This also adds a similar logic when retrying LOOP_SET_STATUS64, but
with a slightly altered calculation, since there we definitely want to
wait a bit, under all cases)
On current kernels (5.8 for example) under some conditions I don't fully
grok it might happen that a detached loopback block device still has
partition block devices around. Accessing these partition block devices
results in EIO errors (that also fill up dmesg). These devices cannot be
claned up with LOOP_CLR_FD (since the main device already is officially
detached), nor with LOOP_CTL_DELETE (returns EBUSY as long as the
partitions still exist). This is a kernel bug. But it appears to apply
to all recent kernels. I cannot really pin down what triggers this,
suffice to say our heavy-duty test can trigger it.
Either way, let's do something about it: when we notice this state we'll
attach an empty file to it, which is guaranteed to have to part table.
This makes the partitions go away. After closing/reoping the device we
hence are good to go again. ugly workaround, but I think OK enough to
use.
The net result is: with this commit, we'll guarantee that by the time we
attach a file to the loopback device we have zero kernel partitions
associated with it. Thus if we then wait for the kernel partitions we
need to appear we should have entirely reliable behaviour even if
loopback devices by the name are heavily recycled and udev events reach
us very late.
Fixes: #16858
When we fall back to classic LOOP_SET_FD logic in case LOOP_CONFIGURE
didn't work we issue LOOP_CLR_FD first. But that call turns out to be
potentially async in the kernel: if something else (let's say
udev/blkid) is accessing the device the ioctl just sets the autoclear
flag and exits. Hence quite often the LOOP_SET_FD will subsequently
fail. Let's avoid the trouble, and immediately exit with EBUSY if
LOOP_CONFIGURE fails, and but remember that LOOP_CONFIGURE is not
available so that on the next iteration we go directly for LOOP_SET_FD
instead.
It appears LOOP_CONFIGURE in 5.8 is even more broken than initially
thought: it doesn't properly propgate lo_sizelimit to the block device
layer. :-(
Let's hence check the block device size immediately after issuing
LOOP_CONFIGURE, and if it doesn't match what we just set let's fallback
to the old ioctls.
This means LOOP_CONFIGURE currently works correctly only for the most
simply case: no partition table logic and no size limit. Sad!
(Kernel people should really be told about the concepts of tests and
even CI, one day!)
LOOP_CONFIGURE allows us to configure a loopback device in one ioctl
instead of two, which is not just faster but also removes the race that
udev might start probing the device before we adjusted things properly.
Unfortunately LOOP_CONFIGURE is broken in regards to LO_FLAGS_PARTSCAN
as of kernel 5.8.0. This patch contains a work-around for that, to
fallback to old behaviour if partition scanning is requested but does
not work. Sucks a bit.
Proposed upstream fix for that issue:
https://lkml.org/lkml/2020/8/6/97
Apparently, if IO is still in flight at the moment we invoke LOOP_CLR_FD
it is likely simply dropped (probably because yanking physical storage,
such as a USB stick would drop it too). Let's protect ourselves against
that and always sync explicitly before we invoke it.