Yan, Zheng [Fri, 6 Jan 2017 07:42:52 +0000 (15:42 +0800)]
mds: fix null pointer dereference in Locker::handle_client_caps
Locker::handle_client_caps delays processing cap message if the
corresponding inode is freezing or frozen. When the message gets
processed, client can have already closed the session.
Sage Weil [Thu, 22 Dec 2016 18:05:22 +0000 (13:05 -0500)]
qa/tasks/workunit: clear clone dir before retrying checkout
If we checkout ceph-ci.git, and don't find a branch,
we'll try again from ceph.git. But the checkout will
already exist and the clone will fail, so we'll still
fail to find the branch.
The same can happen if a previous workunit task already
checked out the repo.
Fix by removing the repo before checkout (the first and
second times). Note that this may break if there are
multiple workunit tasks running in parallel on the same
role. That is already racy, so if it's happening, we'll
want to switch to using a truly unique clonedir for each
instantiation.
Samuel Just [Mon, 12 Dec 2016 18:35:38 +0000 (10:35 -0800)]
PG: fix cached_removed_snaps bug in PGPool::update after map gap
5798fb3bf6d726d14a9c5cb99dc5902eba5b878a actually made 15943 worse
by always creating an out-of-date cached_removed_snaps value after
a map gap rather than only in the case where the the first map after
the gap did not remove any snapshots.
Jeff Layton [Tue, 3 Jan 2017 17:56:51 +0000 (12:56 -0500)]
client: don't use special faked-up inode for /..
The CEPH_INO_DOTDOT thing is quite strange. Under most OS (Linux
included), the parent of the root is itself. IOW, at the root, '.' and
'..' refer to the same inode.
Change the ceph client to do the same, as this allows users to get
valid stat info for '..', as well as elimnating some special-casing.
Also in several places, we're checking dn_set.empty as an indicator
of being the root. While that is true for the root, it's also true
for unlinked directories.
This patch has treats them the same. An unlinked directory will
be reparented to itself, effectively acting as a root of its own.
Fixes: http://tracker.ceph.com/issues/18408 Signed-off-by: Jeff Layton <jlayton@redhat.com>
(cherry picked from commit 30d4ca01db0de9a1e12658793ba9bf9faf0331dd)
Jeff Layton [Tue, 20 Dec 2016 19:44:04 +0000 (14:44 -0500)]
ceph_disk: fix a jewel checkin test break
Silly python:
ceph_disk/main.py:173:1: E305 expected 2 blank lines after class or function definition, found 1
ceph_disk/main.py:5011:1: E305 expected 2 blank lines after class or function definition, found 1
Jeff Layton [Tue, 20 Dec 2016 13:17:21 +0000 (08:17 -0500)]
client: drop setuid/setgid bits on ownership change
When we hold exclusive auth caps, then the client is responsible for
handling changes to the mode. Make sure we remove any setuid/setgid
bits on an ownership change.
Jeff Layton [Tue, 20 Dec 2016 13:16:43 +0000 (08:16 -0500)]
mds: clear setuid/setgid bits on ownership changes
If we get a ownership change, POSIX mandates that you clear the
setuid and setgid bits unless you are "appropriately privileged", in
which case the OS is allowed to leave them intact.
Linux however always clears those bits, regardless of the process
privileges, as that makes it simpler to close some potential races.
Have ceph do the same.
Jeff Layton [Tue, 20 Dec 2016 13:07:23 +0000 (08:07 -0500)]
client: set metadata["root"] from mount method when it's called with a pathname
Currently, we only set the root properly config file or the
--client_metadata command line option. If a userland client program
tries to call ceph_mount with a pathname, it's not being properly
set.
Since we already hold the mutex, we can just update it directly.
Sage Weil [Wed, 14 Dec 2016 17:18:29 +0000 (12:18 -0500)]
tasks/workunit: remove kludge to use git.ceph.com
This was hard-coded to ceph.git (almost) and breaks when
you specify --ceph-repo. Remove it entirely. We'll see if
github.com is better at handling our load than it used to
be!
Sage Weil [Thu, 8 Dec 2016 00:25:55 +0000 (18:25 -0600)]
msg/simple/Pipe: avoid returning 0 on poll timeout
If poll times out it will return 0 (no data to read on socket). In 165e5abdbf6311974d4001e43982b83d06f9e0cc we changed tcp_read_wait from
returning -1 to returning -errno, which means we return 0 instead of -1
in this case.
This makes tcp_read() get into an infinite loop by repeatedly trying to
read from the socket and getting EAGAIN.
Fix by explicitly checking for a 0 return from poll(2) and returning
EAGAIN in that case.
Loic Dachary [Wed, 30 Nov 2016 23:28:32 +0000 (00:28 +0100)]
ceph-disk: enable --runtime ceph-osd systemd units
If ceph-osd@.service is enabled for a given device (say /dev/sdb1 for
osd.3) the ceph-osd@3.service will race with ceph-disk@dev-sdb1.service
at boot time.
Enabling ceph-osd@3.service is not necessary at boot time because
ceph-disk@dev-sdb1.service
calls
ceph-disk activate /dev/sdb1
which calls
systemctl start ceph-osd@3
The systemctl enable/disable ceph-osd@.service called by ceph-disk
activate is changed to add the --runtime option so that ceph-osd units
are lost after a reboot. They are recreated when ceph-disk activate is
called at boot time so that:
systemctl stop ceph
knows which ceph-osd@.service to stop when a script or sysadmin wants
to stop all ceph services.
Before enabling ceph-osd@.service (that happens at every boot time),
make sure the permanent enablement in /etc/systemd is removed so that
only the one added by systemctl enable --runtime in /run/systemd
remains. This is useful to upgrade an existing cluster without creating
a situation that is even worse than before because ceph-disk@.service
races against two ceph-osd@.service (one in /etc/systemd and one in
/run/systemd).
Loic Dachary [Wed, 30 Nov 2016 16:33:54 +0000 (17:33 +0100)]
build/ops: restart ceph-osd@.service after 20s instead of 100ms
Instead of the default 100ms pause before trying to restart an OSD, wait
20 seconds instead and retry 30 times instead of 3. There is no scenario
in which restarting an OSD almost immediately after it failed would get
a better result.
It is possible that a failure to start is due to a race with another
systemd unit at boot time. For instance if ceph-disk@.service is
delayed, it may start after the OSD that needs it. A long pause may give
the racing service enough time to complete and the next attempt to start
the OSD may succeed.
This is not a sound alternative to resolve a race, it only makes the OSD
boot process less sensitive. In the example above, the proper fix is to
enable --runtime ceph-osd@.service so that it cannot race at boot time.
The wait delay should not be minutes to preserve the current runtime
behavior. For instance, if an OSD is killed or fails and restarts after
10 minutes, it will be marked down by the ceph cluster. This is not a
change that could break things but it is significant and should be
avoided.
Loic Dachary [Tue, 22 Nov 2016 14:26:18 +0000 (15:26 +0100)]
ceph-disk: trigger must ensure device ownership
The udev rules that set the owner/group of the OSD devices are racing
with 50-udev-default.rules and depending on which udev event fires last,
ownership may not be as expected.
Since ceph-disk trigger --sync runs as root, always happens after
dm/lvm/filesystem units are complete and before activation, it is a good
time to set the ownership of the device.
It does not eliminate all races: a script running after systemd
local-fs.target and firing a udev event may create a situation where the
permissions of the device are temporarily reverted while the activation
is running.
Loic Dachary [Tue, 22 Nov 2016 13:45:45 +0000 (14:45 +0100)]
ceph-disk: systemd unit must run after local-fs.target
A ceph udev action may be triggered before the local file systems are
mounted because there is no ordering in udev. The ceph udev action
delegates asynchronously to systemd via ceph-disk@.service which will
fail if (for instance) the LVM partition required to mount /var/lib/ceph
is not available yet. The systemd unit will retry a few times but will
eventually fail permanently. The sysadmin can systemctl reset-fail at a
later time and it will succeed.
Add a dependency to ceph-disk@.service so that it waits until the local
file systems are mounted:
After=local-fs.target
Since local-fs.target depends on lvm, it will wait until the lvm
partition (as well as any dm devices) is ready and mounted before
attempting to activate the OSD. It may still fail because the
corresponding journal/data partition is not ready yet (which is
expected) but it will no longer fail because the lvm/filesystems/dm are
not ready.
Nathan Cutler [Thu, 24 Nov 2016 10:25:35 +0000 (11:25 +0100)]
thrashosds: try ceph-objectstore-tool for 10 minutes
If ceph-objectstore-tool binary is not present, it's likely because we're in
the middle of an upgrade. Do not try to run the binary until we verify that
it's really present. If it is absent, spend up to 10 minutes waiting for it to
appear.
Before this patch there was quite a large window for a race to occur. This
patch doesn't entirely eliminate it, but drastically reduces it.
Nathan Cutler [Sat, 3 Dec 2016 12:29:56 +0000 (13:29 +0100)]
build/ops: fix undefined crypto references with --with-xio
Only with --with-xio, RPM build fails due to undefined references to various
symbols starting with "PK11_" in ./.libs/libcommon.a(Crypto.o) in several
of the unit tests.
Samuel Just [Mon, 14 Nov 2016 19:50:23 +0000 (11:50 -0800)]
OSDMonitor: only reject MOSDBoot based on up_from if inst matches
If the osd actually restarts, there is no guarrantee that the epoch will
advance past up_from. If the inst is different, it can't really be a
dup. At worst, it might be a queued MOSDBoot from a previous inst, but
in that case, the real inst would see itself marked up, and then back
down causing it to try booting again.
"ceph-disk trigger" invocation is currently performed in a mutually
exclusive fashion, with each call first taking an flock on the path
/var/lock/ceph-disk. On systems with a lot of osds, this leads to a
large amount of lock contention during boot-up, and can cause some
service instances to trip the 120 second timeout.
Take an flock on a device specific path instead of /var/lock/ceph-disk,
so that concurrent "ceph-disk trigger" invocations are permitted for
independent osds. This greatly reduces lock contention and consequently
the chance of service timeout. Per-device concurrency restrictions
required for http://tracker.ceph.com/issues/13160 are maintained.
Fixes: http://tracker.ceph.com/issues/18060 Signed-off-by: David Disseldorp <ddiss@suse.de>
(cherry picked from commit 8a62cbc074b711275cfd57b372bfb35f6a017833)
Loic Dachary [Tue, 29 Nov 2016 08:49:15 +0000 (09:49 +0100)]
upgrade/hammer-x: verify shec before the full upgrade
The hammer-x/stress-split-erasure-code upgrade sequence comes from
hammer-x/stress-split and was modified to fully upgrade the cluster. It
previously upgraded only half of it. Verifying that the shec plugin is
not available and that trying to set it does not crash the OSD or the
MON must be tried before the upgrade is complete.
[Updated write_file to use all feaetures]
[Updated OSDMonitor.cc to use mon->quorum_features instead of the
mon->get_quorum_con_featuers() helper]
[trivial conflict from removed write_file and read_file]
Sage Weil [Mon, 28 Nov 2016 15:29:40 +0000 (10:29 -0500)]
upgrade/hammer-x/parallel: white 'failed to encode'
The problem here has nothing to do with osdmap
encoding, but that hammer -> jewel makes the systemd
transition and installing the package starts
the mons.. before the osds. I'm not sure what
the workaround for that is but the osdmap issue
appears okay, so ignore this for now.