Kefu Chai [Sat, 7 Oct 2017 14:15:11 +0000 (22:15 +0800)]
ceph-disk: retry on OSError
we are likely to
1) create partition, for instance, sdc1
2) partprobe sdc
3) udevadm settle
4) check the device by its path: /dev/sdc1
but there is chance that the uevent sent from kernel fails to reach udev
before we call "udevadm", hence "/dev/sdc1" does not exist even after
"udevadm settle" returns. so we retry in case of OSError here.
Ken Dreyer [Thu, 7 Sep 2017 17:07:59 +0000 (11:07 -0600)]
.gitignore: allow debian .patch files
The Ubuntu packaging layout with git-buildpackage assumes a
"debian/patches/" directory with several .patch files in it.
When upstream's .gitignore tells Git to ignore .patch files, we have to
edit that line out downstream. When we forget to do that downstream, it
can lead to missing patches and broken downstream builds.
Allow patches in the /debian/patches directory so it's easier to
maintain an Ubuntu package based on upstream's Git repo.
Sage Weil [Thu, 5 Oct 2017 20:26:16 +0000 (15:26 -0500)]
src/messages/MOSDMap: reencode OSDMap for older clients
We explicitly select which missing bits trigger a reencode. We
already had jewel and earlier covered, but kraken includes all of
the previously mentioned bits but not SERVER_LUMINOUS. This
prevents kraken clients from decoding luminous maps.
Kefu Chai [Thu, 31 Aug 2017 10:15:28 +0000 (18:15 +0800)]
cmake: disable VTA on options.cc
to silence following warning and to avoid compiling this file twice:
ceph/src/common/options.cc: In function ‘std::vector<Option> get_global_options()’:
ceph/src/common/options.cc:151:21: note: variable tracking
size limit exceeded with -fvar-tracking-assignments, retrying without
std::vector<Option> get_global_options() {
^~~~~~~~~~~~~~~~~~
Sage Weil [Wed, 4 Oct 2017 20:28:26 +0000 (15:28 -0500)]
osd/PG: separate event for RemoteReservationCanceled
Right now we transparently map a RemoteReservationRejected into a
*Canceled event because this what peers send over the wire. Even
once new peers start sending and explicit CANCEL, old peers will
still do so, so we'll maintain this mapping for a while.
Sage Weil [Sun, 1 Oct 2017 20:04:34 +0000 (15:04 -0500)]
osd/PG: handle RecoveryReservationRejected in RepWaitRecoveryReserved
This state is analogous to RepWaitBackfillReserved; just like we do there
we want to handle the REJECT from the primary by canceling our local
remote_reservation.
Sage Weil [Sun, 1 Oct 2017 20:03:22 +0000 (15:03 -0500)]
osd/PG: ignore RemoteReservationRejected if we are RepNotRecoverying
The primary may send us a REJECT (meaning cancel) if recovery/backfill is
preempted there. That can happen even if the recovery isn't reserved or
requested here (e.g., because the primary is still waiting for the local
reservation). Just ignore it and remain in RepNotRecovering.
Sage Weil [Sun, 1 Oct 2017 20:01:05 +0000 (15:01 -0500)]
osd/PG: cancel local reservation in RemoteReservationRejected handler
We can get a RemoteReservationRejected event either because *we* decide
to reject, or because we get a REJECT from the primary that means "cancel"
(e.g., because recovery/backfill was preempted there). In both cases we
want to cancel our remote_reservation.
Sage Weil [Sun, 1 Oct 2017 19:59:31 +0000 (14:59 -0500)]
osd/PG: move reject_reservation out of RemoteReservationRejected reaction
The RemoteReservationRejected event is also submitted when we are a
replica or backfill target and get a MBackfillReserve REJECT message
because the primary canceled or was preempted. In that case, we don't
want to send a REJECT back to the primary; we only need to send it in the
cases where *we*, locally, decide to reject. Move the call to those call
sites.
osd: make the PG's SORTBITWISE assert a more generous shutdown
We want to stop working if we get activated while sortbitwise is not set
on the cluster, but we might have old maps where it wasn't if the flag
was changed recently. And doing it in the PG code was a bit silly anyway.
Instead check SORTBITWISE in the main OSDMap handling code prior to
prepublishing it. Let it go through if we aren't active at the time.
Add _interfaces option to constrain the choice of IPs in the network
list to those on interfaces matching the provided list of interface names.
The _interfaces options only work in concert with the _network options,
so you must also specify a list of networks if you want to use a specific
interface, e.g., by specifying a broad network like "::" or "0.0.0.0/0".
Greg Farnum [Tue, 3 Oct 2017 22:54:06 +0000 (15:54 -0700)]
msgr: add a mechanism for Solaris to avoid dying on SIGPIPE
This is fairly clean: we define an RAII object in the Messenger.h on
Solaris, and "declare" it with a macro in the implementations. There's
no code duplication and on Linux it's just entirely compiled out.
Sage Weil [Tue, 3 Oct 2017 21:48:37 +0000 (16:48 -0500)]
os/bluestore: use normal Context for async deferred_try_submit
I'm not quite sure why the FunctionContext did not ever execute on the
finisher thread (perhaps the [&] captured some state on the stack that it
shouldn't have?). In any case, using a traditional Context here appears
to resolve the problem (of the async deferred_try_submit() never executing,
leading to a bluestore stall/deadlock).
Sage Weil [Fri, 29 Sep 2017 18:47:19 +0000 (13:47 -0500)]
os/bluestore: wake kv thread when blocking on deferred_bytes
We need to wake the kv thread whenever setting deferred_aggressive to
ensure that txns with deferred io that have committed but haven't submitted
their deferred writes get submitted. This aligns us with the other
users of deferred_aggressive (e.g., _osr_drain_all).
Sage Weil [Wed, 4 Oct 2017 13:25:38 +0000 (08:25 -0500)]
mgr/localpool: fix rule selection
The 'osd pool create' arg parsing is broken; the rule name for
'ceph osd pool create $name $numpgs replicated $rulename' is passed
via the erasure_code_profile param. Too many req=false options
without a way to disambiguate them.
Work around it by passing both 'rule' and 'erasure_code_profile'
keys, so that if/when the hack in OSDMonitor.cc is removed it will
still work. Blech.
Sage Weil [Tue, 19 Sep 2017 20:26:40 +0000 (15:26 -0500)]
osd/PG: allow local recovery reservations to be preempted
If a PG has a higher recovery priority and a lower-priority item is in
progress, allow it to be preempted. This triggers the RecoveryCancel
or BackfillCancel event with a 0 delay, which means it will immediately
re-request a reservation (and presumably wait).
mon/MgrMonitor: read cmd descs if empty on update_from_paxos()
If the MgrMonitor's `command_descs` is empty, the monitor will not send
the mgr commands to clients on `get_descriptions`. This, in turn, has
the clients sending the commands to the monitors, which will have no
idea how to handle them.
Therefore, make sure to read the `command_descs` from disk if the vector
is empty.
Fixes: http://tracker.ceph.com/issues/21300 Signed-off-by: Joao Eduardo Luis <joao@suse.de>
(cherry picked from commit 3d06079bae0fbc096d6c3639807d9be3597e841a)
Ramana Raja [Wed, 13 Sep 2017 14:23:43 +0000 (19:53 +0530)]
pybind/ceph_volume_client: add get, put, and delete object interfaces
Wrap low-level rados APIs to allow ceph_volume_client to get, put, and
delete objects. The interfaces would allow OpenStack Manila's
cephfs driver to store config data in a shared storage to implement
highly available Manila deployments. Restrict write(put) and
read(get) object sizes to 'osd_max_size' config setting.
... class attribute of the 'CephFSVolumeClient' class. It was supposed
to record the earliest version of CephFSVolumeClient that the current
version is compatible with. It's not useful data to be stored as a
class attribute.
mon/MgrMonitor: populate on-disk cmd descs if empty on upgrade
During kraken, when we first introduced the mgrs, we wouldn't populate
the on-disk command descriptions on create_initial(). Therefore, if we
are upgrading from a cluster that never had a mgr, we may end up
crashing because we have no cmd descs to load from disk.
Fixes: http://tracker.ceph.com/issues/21300 Signed-off-by: Joao Eduardo Luis <joao@suse.de>
Sage Weil [Thu, 10 Aug 2017 20:44:59 +0000 (16:44 -0400)]
os/bluestore: allocate entire write in one go
On the first pass through the writes, compress data and calculate a final
amount of space we need to allocate. On the second pass, assign the
extents to blobs and queue the writes.
This allows us to do a single allocation for all blobs, which will lead
to less fragmentation and a much better write pattern.
Sage Weil [Tue, 19 Sep 2017 19:44:50 +0000 (14:44 -0500)]
osd/PG: set {backfill,recovery}_wait when canceling backfill/recovery
The only caller currently is when we get as far as we can with backfill
or recovery but still have unfound objects. In this case, go back into
the *_wait state instead of appearing as though we are still doing
something.
Ilya Dryomov [Thu, 17 Aug 2017 13:35:42 +0000 (15:35 +0200)]
qa/tasks/rbd.xfstests: take exclude list from yaml
Different filesystems (and further, different configurations of the
same filesystem) need different exclude lists. Hard coding the list in
a wrapper script is inflexible.
Ilya Dryomov [Wed, 16 Aug 2017 09:47:19 +0000 (11:47 +0200)]
qa/run_xfstests.sh: quit building xfstests on test nodes
xfstests is a pain to build on trusty, xenial and centos7 with a single
script. It is also very sensitive to dependencies, which again need to
be managed on all those distros -- different sets of supported commands
and switches, some versions have known bugs, etc.
Download a pre-built, statically linked tarball and use it instead.
The tarball was generated using xfstests-bld by Ted Ts'o, with a number
of tweaks by myself (mostly concerning the build environment).
Ilya Dryomov [Wed, 16 Aug 2017 09:47:19 +0000 (11:47 +0200)]
qa/run_xfstests.sh: drop *_MKFS_OPTIONS variables
AFAICT ./check doesn't query EXT4_MKFS_OPTIONS or BTRFS_MKFS_OPTIONS,
We don't need anything special for xfs, so remove all of them to avoid
confusion.