Loic Dachary [Fri, 18 Dec 2015 23:53:03 +0000 (00:53 +0100)]
ceph-disk: protect deactivate with activate lock
When ceph-disk prepares the disk, it triggers udev events and each of
them ceph-disk activate. If systemctl stop ceph-osd@2 happens while
there still are ceph-disk activate in flight, the systemctl stop may be
cancelled by the systemctl enable issued by one of the pending ceph-disk
activate.
This only matters in a test environment where disks are destroyed
shortly after they are activated.
Loic Dachary [Fri, 18 Dec 2015 16:03:21 +0000 (17:03 +0100)]
ceph-disk: use blkid instead of sgdisk -i
sgdisk -i 1 /dev/vdb opens /dev/vdb in write mode which indirectly
triggers a BLKRRPART ioctl from udev (starting version 214 and up) when
the device is closed (see below for the udev release note). The
implementation of this ioctl by the kernel (even old kernels) removes
all partitions and adds them again (similar to what partprobe does
explicitly).
The side effects of partitions disappearing while ceph-disk is running
are devastating.
sgdisk is replaced by blkid which only opens the device in read mode and
will not trigger this unexpected behavior.
The problem does not show on Ubuntu 14.04 because it is running udev <
214 but shows on CentOS 7 which is running udev > 214.
git clone git://anonscm.debian.org/pkg-systemd/systemd.git
systemd/NEWS:
CHANGES WITH 214:
* As an experimental feature, udev now tries to lock the
disk device node (flock(LOCK_SH|LOCK_NB)) while it
executes events for the disk or any of its partitions.
Applications like partitioning programs can lock the
disk device node (flock(LOCK_EX)) and claim temporary
device ownership that way; udev will entirely skip all event
handling for this disk and its partitions. If the disk
was opened for writing, the close will trigger a partition
table rescan in udev's "watch" facility, and if needed
synthesize "change" events for the disk and all its partitions.
This is now unconditionally enabled, and if it turns out to
cause major problems, we might turn it on only for specific
devices, or might need to disable it entirely. Device Mapper
devices are excluded from this logic.
Loic Dachary [Wed, 16 Dec 2015 14:57:03 +0000 (15:57 +0100)]
ceph-disk: dereference symlinks in destroy and zap
The behavior of partprobe or sgdisk may be subtly different if given a
symbolic link to a device instead of an actual device. The debug output
is also more confusing when the symlink shows instead of the device it
points to.
Always dereference the symlink before running destroy and zap.
The default of 120 seconds may be exceeded when the disk is very slow
which can happen in cloud environments. Increase it to 600 seconds
instead.
The partprobe command may fail for the same reason but it does not have
a timeout parameter. Instead, try a few times before failing.
The udevadm settle guarding partprobe are not necessary because
partprobe already does the same. However, partprobe does not provide a
way to control the timeout. Having a udevadm settle after another is
going to be a noop most of the time and not add any delay. It matters
when the udevadm settle run by partprobe fails with a timeout because
partprobe will silentely ignores the failure.
Yongqiang He [Thu, 17 Dec 2015 18:57:07 +0000 (13:57 -0500)]
mon: modify the level of a log about OSD's condition in OSDMonitor.cc
In actual use, we replace old hard disk by new or do performance testing regularly and need to down OSDs manually, but there are some OSDs can not be marked down, and there is no obvious information.
The log { dout(5) << "can_mark_down current up_ratio " << up_ratio << " < min "<< g_conf->mon_osd_min_up_ratio<< ", will not mark osd." << i << "down" << dendl ; }" can explain why it happened.
In addition, we can change the value of mon_osd_min_up_ratio more reasonable in our operating environment.so it is necessary to adjust the log's level lower.
For example:
There are 6 OSDs , wen have marked down 5 of them and the mon_osd_min_up_ratio = 0.3.
In this situation, when we mark the last OSD to down, it will show "ceph-osd stop/waiting", but in actually, the OSD is still up.
Signed-off-by: Yongqiang He <he.yongqiang@h3c.com>
Somnath Roy [Fri, 20 Nov 2015 03:06:17 +0000 (22:06 -0500)]
FileStore: Conditional collection of drive metadata
get_device_by_uuid->blkid_find_dev_with_tag() call from
FileStore::collect_metadata() is hanging for ~3min before returning
EINVAL in case the drive is visible but reserved for some other host.
This is probably is bug within blkid* calls. fdisk/lsblk call is coming
out immediately saying device is inaccessible. This call is now
protected by config option filestore_collect_device_partition_information
Signed-off-by: Somnath Roy <somnath.roy@sandisk.com>
John Spray [Tue, 15 Dec 2015 17:19:30 +0000 (17:19 +0000)]
mon: add `osd blacklist clear`
This is just like 'blacklist rm' except it removes
everything. Useful if you've got a whole bunch of
things in your blacklist and you don't want to wait
for N "blacklist rm" commands to run.
wuxiangwei [Tue, 15 Dec 2015 13:28:36 +0000 (08:28 -0500)]
rbd: specify pool name for rbd admin socket commands
Add the pool name for a given rbd imgae when executing rbd admin socket
commands in case there are more than one images with the same name in
different pools.
There was an accidental move of this line
when adding the MAY_SET_POOL check, which
was causing setxattr to proceed before
it had the right locks, and thereby apply
its checks on bad data (symptom was failing
to detect that the file had data written to it).
Fixes: #14029 Signed-off-by: John Spray <john.spray@redhat.com>
Piotr Dałek [Tue, 15 Dec 2015 13:01:51 +0000 (14:01 +0100)]
makefiles: remove bz2-dev from dependencies
The only thing that uses bzip2-devel is RocksDB, and it's optional, not
requirement. Drop the bzip2-devel/libbz2-dev dependency entirely, and
let RocksDB use it only if it is already present.
Fixes: #13981 Signed-off-by: Piotr Dałek <piotr.dalek@ts.fujitsu.com>
Igor Podoski [Wed, 2 Dec 2015 06:25:00 +0000 (07:25 +0100)]
test/encoding/readable.sh: add non-whole type skip
With this modification there will be possibility to skip only one/few
object/s of particular type, not whole type as it was before.
Before:
- To skip whole TYPE create file named TYPE in forward_incompat
directory.
Now:
- To skip whole TYPE create file or empty directory named TYPE in
forward_incompat directory.
- To skip one/few object/s of TYPE create directory named TYPE in
forward_incompat and put into symbolic links to objects that
you wantt to skip.
Signed-off-by: Igor Podoski <igor.podoski@ts.fujitsu.com>
Kefu Chai [Tue, 15 Dec 2015 04:28:13 +0000 (12:28 +0800)]
rgw: fix the build failure
* s/bucket_name_str/bucket_name/
* the member variable name of `req_state` was changed to `bucket_name`
in f7ca00a. but some commits was still using the old name before the
commit got merged.
* fixes for the tenant related change
* also fixes a typo