retry forever if cct->_conf->bdev_flock_retry is 0.
systemd-udevd is most likely the reason why ceph-osd fails to
acquire the flock when "mkfs", because systemd-udevd probes
all block devices when the device changes in the system using
libblkid, and when systemd-udevd starts looking at the device
it takes a `LOCK_SH|LOCK_NB` lock. and it releases the lock
right after done with it. so normally, it only takes a jiffy,
see
https://github.com/systemd/systemd/blob/
ee0b9e721a368742ac6fa9c3d9a33e45dc3203a2/src/shared/lockfile-util.c#L18
so, we just need to retry couple times before acquiring the
lock.
Fixes: https://tracker.ceph.com/issues/46124
Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit
743b5bda6559c9be0e64617aa43ef5e06a5a6e60)
Conflicts:
src/blk/kernel/KernelDevice.cc
- file does not exist in nautilus; made changes manually in
src/os/bluestore/KernelDevice.cc
Option("bdev_flock_retry", Option::TYPE_UINT, Option::LEVEL_ADVANCED)
.set_default(3)
- .set_description("times to retry the flock"),
+ .set_description("times to retry the flock")
+ .set_long_description(
+ "The number of times to retry on getting the block device lock. "
+ "Programs such as systemd-udevd may compete with Ceph for this lock. "
+ "0 means 'unlimited'."),
Option("bluefs_alloc_size", Option::TYPE_SIZE, Option::LEVEL_ADVANCED)
.set_default(1_M)
dout(1) << __func__ << " flock busy on " << path << dendl;
if (const uint64_t max_retry =
cct->_conf.get_val<uint64_t>("bdev_flock_retry");
- nr_tries++ == max_retry) {
+ max_retry > 0 && nr_tries++ == max_retry) {
return -EAGAIN;
}
double retry_interval =