Loic Dachary [Thu, 10 Dec 2015 14:20:32 +0000 (15:20 +0100)]
tests: verify it is possible to reuse an OSD id
When an OSD id is removed via ceph osd rm, it will be reused by the next
ceph osd create command. Verify that and OSD reusing such an id
successfully comes up.
Loic Dachary [Tue, 5 Jan 2016 16:33:45 +0000 (17:33 +0100)]
ceph-disk: list accepts absolute dev names
The ceph-disk list subcommand now accepts /dev/sda as well as sda.
The filtering is done on the full list of devices instead of restricting
the number of devices explored. Always obtaining the full list of
devices makes things simpler when trying to match a dmcrypted device to
the corresponding raw device.
Conflicts:
src/ceph-disk: as part of the implementation of deactivate /
destroy in master, the prototype of list_device was changed
to take a list of paths instead of the all arguments (args).
Loic Dachary [Tue, 5 Jan 2016 13:25:51 +0000 (14:25 +0100)]
ceph-disk: display OSD details when listing dmcrypt devices
The details about a device that mapped via dmcrypt are directly
available. Do not try to fetch them from the device entry describing the
devicemapper entry.
Conflicts:
src/ceph-disk: an incorrect attempt was made to fix the same
problem. It was not backported and does not
need to be. It is entirely contained in the
code block removed and is the reason for the
conflict.
Loic Dachary [Tue, 5 Jan 2016 16:42:11 +0000 (17:42 +0100)]
ceph-disk: fix regression in cciss devices names
The cciss driver has device paths such as /dev/cciss/c0d1 with a
matching /sys/block/cciss!c0d1. The general case is that whenever a
device name is found in /sys/block, the / is replaced by the !.
When refactoring the ceph-disk list subcommand, this conversion was
overlooked in a few places. All explicit concatenation of /dev with a
device name are replaced with a call to get_dev_name which does the same
but also converts all ! in /.
Loic Dachary [Thu, 7 Jan 2016 14:06:32 +0000 (15:06 +0100)]
Merge pull request #7001 from dachary/wip-14145-infernalis
infernalis: ceph-disk: use blkid instead of sgdisk -i
On CentOS 7.1 and other operating systems with a version of udev greater or equal to 214,
running ceph-disk prepare triggered unexpected removal and addition of partitions on
the disk being prepared. That created problems ranging from the OSD not being activated
to failures because /dev/sdb1 does not exist although it should.
Loic Dachary [Wed, 6 Jan 2016 22:36:57 +0000 (23:36 +0100)]
tests: ceph-disk cryptsetup close must try harder
Similar to how it's done in dmcrpyt_unmap in master ( 132e56615805cba0395898cf165b32b88600d633 ), the infernalis tests helper
that were deprecated by the addition of the deactivate / destroy
ceph-disk subcommand must try cryptsetup close a few times in some
contexts.
Loic Dachary [Fri, 18 Dec 2015 23:53:03 +0000 (00:53 +0100)]
ceph-disk: protect deactivate with activate lock
When ceph-disk prepares the disk, it triggers udev events and each of
them ceph-disk activate. If systemctl stop ceph-osd@2 happens while
there still are ceph-disk activate in flight, the systemctl stop may be
cancelled by the systemctl enable issued by one of the pending ceph-disk
activate.
This only matters in a test environment where disks are destroyed
shortly after they are activated.
src/ceph-disk: ceph-disk deactivate does not exist in ceph-disk
on infernalis. But the same feature is implemented in
ceph-test-disk.py for test purposes and has the same
problem. The patch is adapted to ceph-test-disk.py.
Loic Dachary [Wed, 6 Jan 2016 10:15:19 +0000 (11:15 +0100)]
ceph-disk: retry cryptsetup remove
Retry a cryptsetup remove ten times. After the ceph-osd terminates, the
device is released asyncrhonously and an attempt to cryptsetup remove
will may fail because it is considered busy. Although a few attempts are
made before giving up, the number of attempts / the duration of the
attempts cannot be controlled with a cryptsetup option. The workaround
is to increase this by trying a few times.
If cryptsetup remove fails for a reason that is unrelated to timeout,
the error will be repeated a few times. There is no undesirable side
effect. It will not hide a problem.
Loic Dachary [Fri, 18 Dec 2015 16:03:21 +0000 (17:03 +0100)]
ceph-disk: use blkid instead of sgdisk -i
sgdisk -i 1 /dev/vdb opens /dev/vdb in write mode which indirectly
triggers a BLKRRPART ioctl from udev (starting version 214 and up) when
the device is closed (see below for the udev release note). The
implementation of this ioctl by the kernel (even old kernels) removes
all partitions and adds them again (similar to what partprobe does
explicitly).
The side effects of partitions disappearing while ceph-disk is running
are devastating.
sgdisk is replaced by blkid which only opens the device in read mode and
will not trigger this unexpected behavior.
The problem does not show on Ubuntu 14.04 because it is running udev <
214 but shows on CentOS 7 which is running udev > 214.
git clone git://anonscm.debian.org/pkg-systemd/systemd.git
systemd/NEWS:
CHANGES WITH 214:
* As an experimental feature, udev now tries to lock the
disk device node (flock(LOCK_SH|LOCK_NB)) while it
executes events for the disk or any of its partitions.
Applications like partitioning programs can lock the
disk device node (flock(LOCK_EX)) and claim temporary
device ownership that way; udev will entirely skip all event
handling for this disk and its partitions. If the disk
was opened for writing, the close will trigger a partition
table rescan in udev's "watch" facility, and if needed
synthesize "change" events for the disk and all its partitions.
This is now unconditionally enabled, and if it turns out to
cause major problems, we might turn it on only for specific
devices, or might need to disable it entirely. Device Mapper
devices are excluded from this logic.
Loic Dachary [Wed, 16 Dec 2015 14:57:03 +0000 (15:57 +0100)]
ceph-disk: dereference symlinks in destroy and zap
The behavior of partprobe or sgdisk may be subtly different if given a
symbolic link to a device instead of an actual device. The debug output
is also more confusing when the symlink shows instead of the device it
points to.
Always dereference the symlink before running destroy and zap.
The default of 120 seconds may be exceeded when the disk is very slow
which can happen in cloud environments. Increase it to 600 seconds
instead.
The partprobe command may fail for the same reason but it does not have
a timeout parameter. Instead, try a few times before failing.
The udevadm settle guarding partprobe are not necessary because
partprobe already does the same. However, partprobe does not provide a
way to control the timeout. Having a udevadm settle after another is
going to be a noop most of the time and not add any delay. It matters
when the udevadm settle run by partprobe fails with a timeout because
partprobe will silentely ignores the failure.
Conflicts:
qa/workunits/ceph-disk/ceph-disk-test.py:
trivial, because destroy/deactivate are not implemented
in infernalis. The existing destroy_osd function
has to be modified so the id returned by sh() does
not have a trailing newline.
Chengyuan Li [Fri, 20 Nov 2015 05:29:39 +0000 (22:29 -0700)]
mon/PGMonitor: MAX AVAIL is 0 if some OSDs' weight is 0
In get_rule_avail(), even p->second is 0, it's possible to be used
as divisor and quotient is infinity, then is converted to an integer
which is negative value.
So we should check p->second value before calculation.
Loic Dachary [Wed, 21 Oct 2015 22:21:49 +0000 (00:21 +0200)]
tests: ceph-disk workunit uses configobj
Instead of using augtool to modify the configuration file, use
configobj. It is also used by the install teuthology task. The .ini
lens (puppet lens really) is unable to read ini files created by
configobj.
Sage Weil [Wed, 2 Dec 2015 19:50:28 +0000 (14:50 -0500)]
osd: call on_new_interval on newly split child PG
We must call on_new_interval() on any interval change *and* on the
creation of the PG. Currently we call it from PG::init() and
PG::start_peering_interval(). However, PG::split_into() did not
do so for the child PG, which meant that the new child feature
bits were not properly initialized and the bitwise/nibblewise
debug bit was not correctly set. That, in turn, could lead to
various misbehaviors, the most obvious of which is scrub errors
due to the sort order mismatch.
xiexingguo [Mon, 2 Nov 2015 13:46:11 +0000 (21:46 +0800)]
Objecter: remove redundant result-check of _calc_target in _map_session.
Result-code check is currently redundant since _calc_target never returns a negative value. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
(cherry picked from commit 5a6117e667024f51e65847f73f7589467b6cb762)
xiexingguo [Thu, 29 Oct 2015 09:32:50 +0000 (17:32 +0800)]
Objecter: potential null pointer access when do pool_snap_list.
Objecter: potential null pointer access when do pool_snap_list. Shall check pool existence first. Fixes: #13639 Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
(cherry picked from commit 865541605b6c32f03e188ec33d079b44be42fa4a)
Loic Dachary [Mon, 2 Nov 2015 23:21:51 +0000 (00:21 +0100)]
tests: test/librados/test.cc must create profile
Now that the create_one_ec_pool function removes the testprofile each
time it is called, it must create the testprofile erasure code profile
again for the test to use.
Loic Dachary [Mon, 2 Nov 2015 19:24:51 +0000 (20:24 +0100)]
tests: destroy testprofile before creating one
The testprofile erasure code profile is destroyed before creating a new
one so that it does not fail when another testprofile erasure code
profile already exists with different parameters.
This must be done when creating erasure coded pools with the C++
interface, in the same way it's done with the C interface.
Boris Ranto [Fri, 30 Oct 2015 17:33:36 +0000 (18:33 +0100)]
rbdmap: Move do_map and do_unmap shell functions to rbdmap script
This patch creates rbdmap shell script that is called from init-rbdmap
init script. The patch also renames src/rbdmap configuration file to
src/etc-rbdmap so that rbdmap shell script can be installed via build
system directly. Finally, the patch accomodates these changes in spec
file and build system.
Sage Weil [Fri, 18 Sep 2015 01:42:53 +0000 (21:42 -0400)]
osd: fix send_failures() locking
It is unsafe to check failure_queue.empty() without the lock. Fixes: #13869 Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit b3ca828ae8ebc9068073494c46faf3e8e1443ada)
Herve Rousseau [Fri, 6 Nov 2015 08:52:28 +0000 (09:52 +0100)]
rgw: fix reload on non Debian systems.
When using reload in non-debian systems, /bin/sh's kill is used to send the HUP signal to the radosgw process.
This kill version doesn't understand -SIGHUP as a valid signal, using -HUP does work.
Jason Dillaman [Tue, 7 Jul 2015 16:11:13 +0000 (12:11 -0400)]
WorkQueue: new PointerWQ base class for ContextWQ
The existing work queues do not properly function if added to a running
thread pool. librbd uses a singleton thread pool which requires
dynamically adding/removing work queues as images are opened and closed.
Jason Dillaman [Mon, 9 Nov 2015 16:22:24 +0000 (11:22 -0500)]
librbd: fixed deadlock while attempting to flush AIO requests
In-flight AIO requests might force a flush if a snapshot was created
out-of-band. The flush completion was previously invoked asynchronously,
potentially via the same thread worker handling the AIO request. This
resulted in the flush operation deadlocking since it can't complete.
Fixes: #13726
Backport: infernalis, hammer Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit bfeb90e5fe24347648c72345881fd3d932243c98)
xiexingguo [Thu, 29 Oct 2015 12:04:11 +0000 (20:04 +0800)]
Objecter: pool_op callback may hang forever.
pool_op callback may hang forever due to osdmap update during reply handling. Fixes: #13642 Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
(cherry picked from commit 00c6fa9e31975a935ed2bb33a099e2b4f02ad7f2)
Sage Weil [Wed, 28 Oct 2015 00:55:26 +0000 (20:55 -0400)]
crush/mapper: ensure bucket id is valid before indexing buckets array
We were indexing the buckets array without verifying the index was within
the [0,max_buckets) range. This could happen because a multistep rule
does not have enough buckets and has CRUSH_ITEM_NONE
for an intermediate result, which would feed in CRUSH_ITEM_NONE and
make us crash.
xiexingguo [Tue, 13 Oct 2015 06:04:20 +0000 (14:04 +0800)]
OSD:shall reset primary and up_primary fields when beginning a new past_interval.
Shall reset primary and up_primary fields when we start over a new past_interval in OSD::build_past_intervals_parallel(). Fixes: #13471 Signed-off-by: xie.xingguo@zte.com.cn
(cherry picked from commit 65064ca05bc7f8b6ef424806d1fd14b87add62a4)
Sage Weil [Fri, 23 Oct 2015 17:27:39 +0000 (13:27 -0400)]
osd: fix OSDService vs Objecter init order
This reverts c7d96a5ed1d2cb844622af29b13705b8f7be6be7, but still keeps
the Objecter init *after* we have authenticated. This way we don't
crash when we get mon messages like MOSDPGCreate, and we also don't
request maps we aren't prepared to handle.
Boris Ranto [Fri, 23 Oct 2015 14:39:16 +0000 (16:39 +0200)]
ceph.spec.in: We no longer need redhat-lsb-core
Drop the redhat-lsb-core dependency as it is no longer necessary on
fedora/rhel.
The other two init scripts do not use redhat-lsb-core either. The
init-ceph.in conditionally requires /lib/lsb/init-functions and does not
use any of the functions defined in that file (at least not directly).
The init-radosgw file includes /etc/rc.d/init.d/functions on non-debian
platforms instead of /lib/lsb/init-functions file so it does not require
redhat-lsb-core either.
Boris Ranto [Fri, 23 Oct 2015 13:31:27 +0000 (15:31 +0200)]
init-rbdmap: Rewrite to use logger + clean-up
This patch rewrites the init-rbdmap init script so that it uses logger
instead of the log_* functions. The patch also fixes various smaller
bugs like:
* MAP_RV was undefined if mapping already existed
* UMNT_RV and UMAP_RV were almost always empty (if they succeeded) ->
removed them
* use of continue instead RET_OP in various places (RET_OP was not being
checked after the switch to logger messages)
* removed use of DESC (used only twice and only one occurrence actually
made sense)
Jason Dillaman [Wed, 21 Oct 2015 17:12:48 +0000 (13:12 -0400)]
librbd: potential assertion failure during cache read
It's possible for a cache read from a clone to trigger a writeback if a
previous read op determined the object doesn't exist in the clone,
followed by a cached write to the non-existent clone object, followed
by another read request to the same object. This causes the cache to
flush the pending writeback ops while not holding the owner lock.
Fixes: #13559
Backport: hammer Signed-off-by: Jason Dillaman <dillaman@redhat.com>