Jason Dillaman [Wed, 9 Mar 2016 23:00:04 +0000 (18:00 -0500)]
librbd: complete cache reads on cache's dedicate thread
If a snapshot is created out-of-band, the next IO will result in the
cache being flushed. If pending writeback data performs a copy-on-write,
the read from the parent will be blocked.
Fixes: #15032 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Fri, 14 Aug 2015 17:28:13 +0000 (13:28 -0400)]
WorkQueue: PointerWQ drain no longer waits for other queues
If another (independent) queue was processing, drain could
block waiting. Instead, allow drain to exit quickly if
no items are being processed and the queue is empty for
the current WQ.
John Spray [Mon, 16 Nov 2015 10:57:56 +0000 (10:57 +0000)]
mon: don't require OSD W for MRemoveSnaps
Use ability to execute "osd pool rmsnap" command
as a signal that the client should be permitted
to send MRemoveSnaps too.
Note that we don't also require the W ability,
unlike Monitor::_allowed_command -- this is slightly
more permissive handling, but anyone crafting caps
that explicitly permit "osd pool rmsnap" needs to
know what they are doing.
Greg Farnum [Wed, 13 Jan 2016 21:17:53 +0000 (13:17 -0800)]
fsx: checkout old version until it compiles properly on miras
I sent a patch to xfstests upstream at
http://article.gmane.org/gmane.comp.file-systems.fstests/1665, but
until that's fixed we need a version that works in our test lab.
Douglas Fuller [Thu, 7 Jan 2016 19:01:19 +0000 (11:01 -0800)]
cls_rbd: enable object map checksums for object_map_save
object_map_save disables CRCs when an object map footer isn't provided.
Unconditionally re-enable object map CRCs before re-encoding the new object
map.
Douglas Fuller [Fri, 22 Jan 2016 19:18:40 +0000 (11:18 -0800)]
rbd: remove canceled tasks from timer thread
When canceling scheduled tasks using the timer thread, TaskFinisher::cancel
does not call SafeTimer::cancel_event, so events fire anyway. Add this call.
Loic Dachary [Thu, 10 Dec 2015 14:20:32 +0000 (15:20 +0100)]
tests: verify it is possible to reuse an OSD id
When an OSD id is removed via ceph osd rm, it will be reused by the next
ceph osd create command. Verify that and OSD reusing such an id
successfully comes up.
Loic Dachary [Tue, 5 Jan 2016 16:33:45 +0000 (17:33 +0100)]
ceph-disk: list accepts absolute dev names
The ceph-disk list subcommand now accepts /dev/sda as well as sda.
The filtering is done on the full list of devices instead of restricting
the number of devices explored. Always obtaining the full list of
devices makes things simpler when trying to match a dmcrypted device to
the corresponding raw device.
Conflicts:
src/ceph-disk: as part of the implementation of deactivate /
destroy in master, the prototype of list_device was changed
to take a list of paths instead of the all arguments (args).
Loic Dachary [Tue, 5 Jan 2016 13:25:51 +0000 (14:25 +0100)]
ceph-disk: display OSD details when listing dmcrypt devices
The details about a device that mapped via dmcrypt are directly
available. Do not try to fetch them from the device entry describing the
devicemapper entry.
Conflicts:
src/ceph-disk: an incorrect attempt was made to fix the same
problem. It was not backported and does not
need to be. It is entirely contained in the
code block removed and is the reason for the
conflict.
Loic Dachary [Tue, 5 Jan 2016 16:42:11 +0000 (17:42 +0100)]
ceph-disk: fix regression in cciss devices names
The cciss driver has device paths such as /dev/cciss/c0d1 with a
matching /sys/block/cciss!c0d1. The general case is that whenever a
device name is found in /sys/block, the / is replaced by the !.
When refactoring the ceph-disk list subcommand, this conversion was
overlooked in a few places. All explicit concatenation of /dev with a
device name are replaced with a call to get_dev_name which does the same
but also converts all ! in /.
Loic Dachary [Thu, 7 Jan 2016 14:06:32 +0000 (15:06 +0100)]
Merge pull request #7001 from dachary/wip-14145-infernalis
infernalis: ceph-disk: use blkid instead of sgdisk -i
On CentOS 7.1 and other operating systems with a version of udev greater or equal to 214,
running ceph-disk prepare triggered unexpected removal and addition of partitions on
the disk being prepared. That created problems ranging from the OSD not being activated
to failures because /dev/sdb1 does not exist although it should.
Loic Dachary [Wed, 6 Jan 2016 22:36:57 +0000 (23:36 +0100)]
tests: ceph-disk cryptsetup close must try harder
Similar to how it's done in dmcrpyt_unmap in master ( 132e56615805cba0395898cf165b32b88600d633 ), the infernalis tests helper
that were deprecated by the addition of the deactivate / destroy
ceph-disk subcommand must try cryptsetup close a few times in some
contexts.
Loic Dachary [Fri, 18 Dec 2015 23:53:03 +0000 (00:53 +0100)]
ceph-disk: protect deactivate with activate lock
When ceph-disk prepares the disk, it triggers udev events and each of
them ceph-disk activate. If systemctl stop ceph-osd@2 happens while
there still are ceph-disk activate in flight, the systemctl stop may be
cancelled by the systemctl enable issued by one of the pending ceph-disk
activate.
This only matters in a test environment where disks are destroyed
shortly after they are activated.
src/ceph-disk: ceph-disk deactivate does not exist in ceph-disk
on infernalis. But the same feature is implemented in
ceph-test-disk.py for test purposes and has the same
problem. The patch is adapted to ceph-test-disk.py.
Loic Dachary [Wed, 6 Jan 2016 10:15:19 +0000 (11:15 +0100)]
ceph-disk: retry cryptsetup remove
Retry a cryptsetup remove ten times. After the ceph-osd terminates, the
device is released asyncrhonously and an attempt to cryptsetup remove
will may fail because it is considered busy. Although a few attempts are
made before giving up, the number of attempts / the duration of the
attempts cannot be controlled with a cryptsetup option. The workaround
is to increase this by trying a few times.
If cryptsetup remove fails for a reason that is unrelated to timeout,
the error will be repeated a few times. There is no undesirable side
effect. It will not hide a problem.
Loic Dachary [Fri, 18 Dec 2015 16:03:21 +0000 (17:03 +0100)]
ceph-disk: use blkid instead of sgdisk -i
sgdisk -i 1 /dev/vdb opens /dev/vdb in write mode which indirectly
triggers a BLKRRPART ioctl from udev (starting version 214 and up) when
the device is closed (see below for the udev release note). The
implementation of this ioctl by the kernel (even old kernels) removes
all partitions and adds them again (similar to what partprobe does
explicitly).
The side effects of partitions disappearing while ceph-disk is running
are devastating.
sgdisk is replaced by blkid which only opens the device in read mode and
will not trigger this unexpected behavior.
The problem does not show on Ubuntu 14.04 because it is running udev <
214 but shows on CentOS 7 which is running udev > 214.
git clone git://anonscm.debian.org/pkg-systemd/systemd.git
systemd/NEWS:
CHANGES WITH 214:
* As an experimental feature, udev now tries to lock the
disk device node (flock(LOCK_SH|LOCK_NB)) while it
executes events for the disk or any of its partitions.
Applications like partitioning programs can lock the
disk device node (flock(LOCK_EX)) and claim temporary
device ownership that way; udev will entirely skip all event
handling for this disk and its partitions. If the disk
was opened for writing, the close will trigger a partition
table rescan in udev's "watch" facility, and if needed
synthesize "change" events for the disk and all its partitions.
This is now unconditionally enabled, and if it turns out to
cause major problems, we might turn it on only for specific
devices, or might need to disable it entirely. Device Mapper
devices are excluded from this logic.
Loic Dachary [Wed, 16 Dec 2015 14:57:03 +0000 (15:57 +0100)]
ceph-disk: dereference symlinks in destroy and zap
The behavior of partprobe or sgdisk may be subtly different if given a
symbolic link to a device instead of an actual device. The debug output
is also more confusing when the symlink shows instead of the device it
points to.
Always dereference the symlink before running destroy and zap.
The default of 120 seconds may be exceeded when the disk is very slow
which can happen in cloud environments. Increase it to 600 seconds
instead.
The partprobe command may fail for the same reason but it does not have
a timeout parameter. Instead, try a few times before failing.
The udevadm settle guarding partprobe are not necessary because
partprobe already does the same. However, partprobe does not provide a
way to control the timeout. Having a udevadm settle after another is
going to be a noop most of the time and not add any delay. It matters
when the udevadm settle run by partprobe fails with a timeout because
partprobe will silentely ignores the failure.
Conflicts:
qa/workunits/ceph-disk/ceph-disk-test.py:
trivial, because destroy/deactivate are not implemented
in infernalis. The existing destroy_osd function
has to be modified so the id returned by sh() does
not have a trailing newline.
Chengyuan Li [Fri, 20 Nov 2015 05:29:39 +0000 (22:29 -0700)]
mon/PGMonitor: MAX AVAIL is 0 if some OSDs' weight is 0
In get_rule_avail(), even p->second is 0, it's possible to be used
as divisor and quotient is infinity, then is converted to an integer
which is negative value.
So we should check p->second value before calculation.
Loic Dachary [Wed, 21 Oct 2015 22:21:49 +0000 (00:21 +0200)]
tests: ceph-disk workunit uses configobj
Instead of using augtool to modify the configuration file, use
configobj. It is also used by the install teuthology task. The .ini
lens (puppet lens really) is unable to read ini files created by
configobj.
Yan, Zheng [Mon, 9 Nov 2015 03:37:02 +0000 (11:37 +0800)]
client: use null snapc to check pool permission
snap inodes' ->snaprealm can be NULL, so dereferencing it in
check_pool_perm() can cause segment fault. The pool permission
check does not write any data, so it's safe to use null snapc.
Sage Weil [Wed, 2 Dec 2015 19:50:28 +0000 (14:50 -0500)]
osd: call on_new_interval on newly split child PG
We must call on_new_interval() on any interval change *and* on the
creation of the PG. Currently we call it from PG::init() and
PG::start_peering_interval(). However, PG::split_into() did not
do so for the child PG, which meant that the new child feature
bits were not properly initialized and the bitwise/nibblewise
debug bit was not correctly set. That, in turn, could lead to
various misbehaviors, the most obvious of which is scrub errors
due to the sort order mismatch.
xiexingguo [Mon, 2 Nov 2015 13:46:11 +0000 (21:46 +0800)]
Objecter: remove redundant result-check of _calc_target in _map_session.
Result-code check is currently redundant since _calc_target never returns a negative value. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
(cherry picked from commit 5a6117e667024f51e65847f73f7589467b6cb762)
xiexingguo [Thu, 29 Oct 2015 09:32:50 +0000 (17:32 +0800)]
Objecter: potential null pointer access when do pool_snap_list.
Objecter: potential null pointer access when do pool_snap_list. Shall check pool existence first. Fixes: #13639 Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
(cherry picked from commit 865541605b6c32f03e188ec33d079b44be42fa4a)