Loic Dachary [Wed, 30 Nov 2016 23:28:32 +0000 (00:28 +0100)]
ceph-disk: enable --runtime ceph-osd systemd units
If ceph-osd@.service is enabled for a given device (say /dev/sdb1 for
osd.3) the ceph-osd@3.service will race with ceph-disk@dev-sdb1.service
at boot time.
Enabling ceph-osd@3.service is not necessary at boot time because
ceph-disk@dev-sdb1.service
calls
ceph-disk activate /dev/sdb1
which calls
systemctl start ceph-osd@3
The systemctl enable/disable ceph-osd@.service called by ceph-disk
activate is changed to add the --runtime option so that ceph-osd units
are lost after a reboot. They are recreated when ceph-disk activate is
called at boot time so that:
systemctl stop ceph
knows which ceph-osd@.service to stop when a script or sysadmin wants
to stop all ceph services.
Before enabling ceph-osd@.service (that happens at every boot time),
make sure the permanent enablement in /etc/systemd is removed so that
only the one added by systemctl enable --runtime in /run/systemd
remains. This is useful to upgrade an existing cluster without creating
a situation that is even worse than before because ceph-disk@.service
races against two ceph-osd@.service (one in /etc/systemd and one in
/run/systemd).
Loic Dachary [Wed, 30 Nov 2016 16:33:54 +0000 (17:33 +0100)]
build/ops: restart ceph-osd@.service after 20s instead of 100ms
Instead of the default 100ms pause before trying to restart an OSD, wait
20 seconds instead and retry 30 times instead of 3. There is no scenario
in which restarting an OSD almost immediately after it failed would get
a better result.
It is possible that a failure to start is due to a race with another
systemd unit at boot time. For instance if ceph-disk@.service is
delayed, it may start after the OSD that needs it. A long pause may give
the racing service enough time to complete and the next attempt to start
the OSD may succeed.
This is not a sound alternative to resolve a race, it only makes the OSD
boot process less sensitive. In the example above, the proper fix is to
enable --runtime ceph-osd@.service so that it cannot race at boot time.
The wait delay should not be minutes to preserve the current runtime
behavior. For instance, if an OSD is killed or fails and restarts after
10 minutes, it will be marked down by the ceph cluster. This is not a
change that could break things but it is significant and should be
avoided.
Loic Dachary [Tue, 22 Nov 2016 14:26:18 +0000 (15:26 +0100)]
ceph-disk: trigger must ensure device ownership
The udev rules that set the owner/group of the OSD devices are racing
with 50-udev-default.rules and depending on which udev event fires last,
ownership may not be as expected.
Since ceph-disk trigger --sync runs as root, always happens after
dm/lvm/filesystem units are complete and before activation, it is a good
time to set the ownership of the device.
It does not eliminate all races: a script running after systemd
local-fs.target and firing a udev event may create a situation where the
permissions of the device are temporarily reverted while the activation
is running.
Loic Dachary [Tue, 22 Nov 2016 13:45:45 +0000 (14:45 +0100)]
ceph-disk: systemd unit must run after local-fs.target
A ceph udev action may be triggered before the local file systems are
mounted because there is no ordering in udev. The ceph udev action
delegates asynchronously to systemd via ceph-disk@.service which will
fail if (for instance) the LVM partition required to mount /var/lib/ceph
is not available yet. The systemd unit will retry a few times but will
eventually fail permanently. The sysadmin can systemctl reset-fail at a
later time and it will succeed.
Add a dependency to ceph-disk@.service so that it waits until the local
file systems are mounted:
After=local-fs.target
Since local-fs.target depends on lvm, it will wait until the lvm
partition (as well as any dm devices) is ready and mounted before
attempting to activate the OSD. It may still fail because the
corresponding journal/data partition is not ready yet (which is
expected) but it will no longer fail because the lvm/filesystems/dm are
not ready.
Ken Dreyer [Mon, 14 Nov 2016 21:49:15 +0000 (14:49 -0700)]
ceph-disk: fix flake8 errors
flake8 3.1.1 surfaces the following issues:
ceph_disk/main.py:173:1: E305 expected 2 blank lines after class or
function definition, found 1
ceph_disk/main.py:5011:1: E305 expected 2 blank lines after class or
function definition, found 1
Mykola Golub [Thu, 10 Nov 2016 13:35:59 +0000 (15:35 +0200)]
librbd: restore journal access when force disabling mirroring
If mirroring is force disabled on a demoted image, the journal was
being left in an inconsistent ownership state.
This is a direct commit to jewel as the fix in the master was
against the newly added async version of mirror disable, which is
not going to be merged to jewel.
Casey Bodley [Wed, 9 Nov 2016 19:27:11 +0000 (14:27 -0500)]
rgw: add missing mutex header for std::once_flag
this fix was added directly to the jewel branch rather than backporting
from master, because the code on master compiles without this specific
include - it's likely included by another header, but backporting would
involve pulling in unrelated changes
Casey Bodley [Fri, 17 Jun 2016 02:51:54 +0000 (22:51 -0400)]
rgw: add pipe fd to set for select() in do_curl_wait()
when HAVE_CURL_MULTI_WAIT is 0, the pipe fd is never added to the
readfds for select(), so FD_ISSET() is always false. this prevents us
from ever trying to read from the fd, and the pipe's buffer eventually
fills up and deadlocks callers of RGWHTTPManager::signal_thread() when
they try to write to the pipe
Brad Hubbard [Fri, 7 Oct 2016 04:51:41 +0000 (14:51 +1000)]
common: Remove the runtime dependency on lsb_release
With modern releases we should be able to make do with the call to
os_release_parse only which uses /etc/os-release which should be available on
most (all?) releases we currently support. this then allows us to remove the
runtime dependency which pulls in several other packages and would be nice to
avoid.
This allows the MDS to respawn using the same executable file even if it
has since been deleted (on Linux). Otherwise, the execv fails because
the readlink returns "/path/to/deleted (deleted)". (There is no path to
the old executable.)
Fixes: http://tracker.ceph.com/issues/17531 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 66a122025f6cf023cf7b2f3d8fbe4964fb7568a7)
Zengran Zhang [Wed, 19 Oct 2016 09:05:27 +0000 (17:05 +0800)]
rgw multisite: fix the increamtal bucket sync init
in the `RGWBucketShardFullSyncCR::operate`, inc_marker will assigned with remote bilog's max_marker.
but the sync_status's inc_marker cant be assigned.so the next step inc sync will always sync
from null log,which means at beginning log.
rgw: get_zonegroup() uses "default" zonegroup if empty
Fixes: http://tracker.ceph.com/issues/17372
An empty zonegroup should be replaced with the "default" zonegroup.
This is needed when dealing with zonegroup set in old bucket info,
that predated setting the buckets' region.
Patrick Donnelly [Mon, 10 Oct 2016 22:16:16 +0000 (18:16 -0400)]
mds: use parse_filesystem in parse_role
This allows us to reuse code in parse_filesystem and avoid
get_filesystem which may fail if the fscid does not exist. This would
result in the program (mon) aborting due to the uncaught exception.
Fixes: http://tracker.ceph.com/issues/17518 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit edc78e46cee356da1e45247c38b7428dd6c965cb)
Yan, Zheng [Sat, 8 Oct 2016 07:16:40 +0000 (15:16 +0800)]
mds: fix false "failing to respond to cache pressure" warning
the false warning happens in following sequence of events
- MDS has cache pressure, sends recall state messages to clients
- Client does not trim as many caps as MDS expected. So MDS
does not reset session->recalled_at
- MDS no longer has cache pressure, it stop sending recall state
messages to clients.
- Client does not release its caps. So session->recalled_at in
MDS keeps unchanged
Yan, Zheng [Sat, 8 Oct 2016 07:16:40 +0000 (15:16 +0800)]
mds: fix false "failing to respond to cache pressure" warning
the false warning happens in following sequence of events
- MDS has cache pressure, sends recall state messages to clients
- Client does not trim as many caps as MDS expected. So MDS
does not reset session->recalled_at
- MDS no longer has cache pressure, it stop sending recall state
messages to clients.
- Client does not release its caps. So session->recalled_at in
MDS keeps unchanged
Ken Dreyer [Fri, 23 Sep 2016 20:49:56 +0000 (14:49 -0600)]
rpm: fix permissions for /etc/ceph/rbdmap
Prior to this change, the RPM packaging would install /etc/ceph/rbdmap
with exectuable permissions. The execute bit is not necessary and does
not match what the Debian packaging does. Remove the execute bit in this
case.
Fixes: http://tracker.ceph.com/issues/17395 Reported-by: Martin Bukatovic <mbukatov@redhat.com> Signed-off-by: Ken Dreyer <kdreyer@redhat.com>
(cherry picked from commit d4b84a13960f9f46593bf89dc92bfc3e54b4851e)
John Spray [Mon, 19 Sep 2016 14:22:01 +0000 (15:22 +0100)]
mds: use a random nonce in Messenger
The MDS is a client to the OSDs, and responds
to blacklists by respawning itself. Usually
respawns of a daemonized process result in a PID
change, but it's not guaranteed, and it's definitely
not the case when someone runs in foreground (e.g.
teuthology).
Using a random nonce makes sure we won't match
against an existing blacklist entry from a failed
instance of an MDS daemon with the same name as us.
Related to: http://tracker.ceph.com/issues/17236 Signed-off-by: John Spray <john.spray@redhat.com>
(cherry picked from commit 5ba612882750dae6f0b057c660cd283293a18a3f)
Conflicts:
src/ceph_mds.cc : Messenger::create() prototype is different
Sage Weil [Fri, 21 Oct 2016 16:25:08 +0000 (12:25 -0400)]
mon/OSDMonitor: encode OSDMap::Incremental with same features as OSDMap
The Incremental encode stashes encode_features, which is
what we use later to reencode the updated OSDMap. Use
the same features so that the encoding will match!
Conflicts:
src/mon/OSDMonitor.cc: remove references to kraken
if ((osdmap.get_up_osd_features() & CEPH_FEATURE_SERVER_KRAKEN) &&
!osdmap.test_flag(CEPH_OSDMAP_REQUIRE_KRAKEN)) {
string msg = "all OSDs are running kraken or later but the"
" 'require_kraken_osds' osdmap flag is not set";
summary.push_back(make_pair(HEALTH_WARN, msg));
if (detail) {
detail->push_back(make_pair(HEALTH_WARN, msg));
}
} else
Sage Weil [Fri, 30 Sep 2016 22:02:39 +0000 (18:02 -0400)]
mon/OSDMonitor: encode canonical full osdmap based on osdmap flags
If the JEWEL or KRAKEN flags aren't set, encode the full map without
those features. This ensure that older OSDs in the cluster will be able
to correctly encode the full map with a matching CRC. At least, that is
true as long as the encoding changes are guarded by those feature bits.
That appears to be true currently, and we plan to ensure that it is true
in the future as well.
rgw: only enable virtual hosting if hostnames are configured
if no hostnames are configured, all requests were treated as virtual
hosted buckets. require at least one hostname in hostnames_set to
consider setting in_hosted_domain
Robin H. Johnson [Thu, 25 Aug 2016 15:04:34 +0000 (08:04 -0700)]
rgw: Fix Host->bucket fallback logic inversion
The logic (added in 46aae19ee) for falling back to just using the hostname as
the possible bucket name contained an accidental inversion, because
RGWHandler_REST::validate_bucket_name returns success as zero.
Backport: jewel Fixes: http://tracker.ceph.com/issues/17136
Re-Fixes: http://tracker.ceph.com/issues/15975 Signed-off-by: Robin H. Johnson <robin.johnson@dreamhost.com>
(cherry picked from commit 70e0289644f4a7205e6c2f75a094ece8ab5ed97c)
David Galloway [Fri, 19 Aug 2016 20:11:32 +0000 (16:11 -0400)]
ceph-post-file: Ignore keys offered by ssh-agent
In my case, I had multiple private keys in ssh-agent which resulted in
the sftp connection failing despite explicitly specifying the private
key to use
Sage Weil [Wed, 2 Nov 2016 13:37:41 +0000 (09:37 -0400)]
ceph-post-file: migrate to RSA SSH keys
DSA keys are being deprecated: http://www.openssh.com/legacy.html
drop.ceph.com will continue to allow the old DSA key but eventually,
users submitting logs using ceph-post-file will run into issues when
OpenSSH completely drops support for the algorithm.
Fixes: http://tracker.ceph.com/issues/14267 Signed-off-by: David Galloway <dgallowa@redhat.com>
(cherry picked from commit ecd02bf3f1c7a07a3271b2736a9e12dd6e897821)
Sage Weil [Sun, 23 Oct 2016 23:40:57 +0000 (18:40 -0500)]
msg: adjust byte_throttler from Message::encode
Normally we never call encode on a message that has a byte_throttler set
because we only use it for messages we received. However, for forwarded
messages that we clear_payload() before resending, we *do* reencode, and in
that case we need to retake the appropriate number of bytes from the
throttler--just like we release them in clear_payload().
Sage Weil [Sat, 22 Oct 2016 18:01:34 +0000 (14:01 -0400)]
messages/MForward: reencode forwarded message if target has differing features
This ensures we reencode the payload with the
appropriate set of features if the client, us, or the
target do not have identical features. Otherwise we
may forward an encoding with more features than the
target can handle.
Sage Weil [Wed, 28 Sep 2016 15:44:28 +0000 (11:44 -0400)]
messages/MForward: fix encoding features
We were encoding the message with the sending client's
features, which makes no sense: we need to encode with
the recipient's features so that it can decode the
message.
The simplest way to fix this is to rip out the bizarre
msg_bl handling code and simply keep a decoded Message
reference, and encode it when we send.
We encode the encapsulated message with the intersection
of the target mon's features and the sending client's
features. This probably doesn't matter, but it's
conceivable that there is some feature-dependent
behavior in the message encode/decode that is important.