Sage Weil [Wed, 23 Sep 2015 14:25:30 +0000 (10:25 -0400)]
osd: do full check in do_op
1. The current pool_last_map_marked_full tracking is buggy.
2. We need to recheck this each time we consider the op, not just when it
is received off the wire. Otherwise, we might get a message, queue it
for some reason, get a map indicating the cluster or pool is full, and
then requeue and process the op instead of discarding it.
3. For now, silently drop ops when failsafe check fails. This will lead to
stalled client IO. This needs a more robust fix.
Sage Weil [Thu, 24 Sep 2015 23:02:21 +0000 (19:02 -0400)]
osdc/Objecter: set FULL_FORCE flag when honor_full is false
This currenty only applies to the MDS. Eventually we can remove the
OSD MDS checks once we are confident all MDS instances are new enough
to set this flag.
Sage Weil [Thu, 24 Sep 2015 15:38:41 +0000 (11:38 -0400)]
osd/PG: compensate for sloppy hobject scrub bounds from hammer
Hammer is sloppy about the hobject_t's it uses for the scrub bounds in that
the pool isn't set. (Hammer FileStore doesn't care, but post-hammer is
much more careful about this sort of thing.)
Compensate by setting the pool on any scrub messages we receive.
build/ops: make dist needs files with names > 99 characters
When running make distdir=ceph-9.0.3-1870-gfd861bb dist, a few files
have names longer than 99 characters and discarded, which then causes
the resulting tarbal to be incomplete:
tar: ceph-9.0.3-1870-gfd861bb/src/rocksdb/utilities/write_batch_with_index/write_batch_with_index_internal.cc: file name is too long (max 99); not dumped
tar: ceph-9.0.3-1870-gfd861bb/src/rocksdb/utilities/write_batch_with_index/write_batch_with_index_internal.h: file name is too long (max 99); not dumped
Use the tar-ustar format instead of the legacy v7
format (http://www.gnu.org/software/automake/manual/automake.html#Options). It
is unlikely machines with a C++11 compiler also have an antique tar
binary that would only support v7.
Sage Weil [Wed, 23 Sep 2015 14:58:01 +0000 (10:58 -0400)]
mon/Elector: do a trivial write on every election cycle
Currently we already do a small write when the *first* election in
a round happens (to update the election epoch). If the backend
happens to fail while we are already in the midst of elections,
however, we may continue to call elections without verifying we
are still writeable.
Sage Weil [Wed, 23 Sep 2015 12:20:39 +0000 (08:20 -0400)]
arch/arm: s/false/0/
arch/arm.c: In function 'ceph_arch_arm_probe':
arch/arm.c:54:28: error: 'false' undeclared (first use in this function)
ceph_arch_aarch64_crc32 = false; // sorry!
^
arch/arm.c:54:28: note: each undeclared identifier is reported only once for each function it appears in
1.key-type assignments based on context if it wasn't specified
In user operate context, key-type assignment to KEY_TYPE_S3
In subuser operate context, key-type assignment to KEY_TYPE_SWIFT
In key operate context, key-type assignment based on user type
2.fix RGWSubUserPool::add()
When create subuser generate secret by default
3.fix RGWAccessKeyPool::generate_key()
Avoid wrong key's username when create user and subuser at the same time
Check empty secret
After preparing an OSD, wait for the corresponding OSD to be up
according to ceph osd dump before asserting the devices are in the
expected state. Otherwise the test races with ceph-disk activate which
is run asynchronously via udev / upstart / system.
ceph-disk: ensure udev add on the data partition is last
When calling partprobe, we make sure there is at least one udev add
called for each partition created when preparing a device. But there is
no guarantee that the udev add for data partition will be last and the
following scenario can happen:
- udev add data partition fails because the journal partition is owned
by root
- udev add journal partition chown the journal partition
- no other udev add event is sent and the OSD does not activate
An additional, possibly redundant, udev add event is fired after
partprobe is run and after udevadm settles, to guarantee there is at
least one udev add data partition after the last udev add journal
partition.
ceph-disk: move update_partition from main_prepare to prepare_dev
The update_partition call in main_prepare happens immediately after
prepare_dev but only if the data argument is a block device. There is no
reason for this separation: it is more sensible to call it from within
prepare_dev.
There is an additional test in prepare_dev that verifies partprobe won't
be called on a partition because it would not make sense.
A side effect of partprobe is to remove partitions and add them again.
The first udevadm settle waits for ongoing udev events to complete, just
in case one of them rely on an existing partition on dev.
The second udevadm settle guarantees to the caller that all udev events
related to the partition table change have been processed, i.e. the
95-ceph-osd.rules actions and mode changes, group changes etc. are
complete.
Set the LOG level as well as the channel level, otherwise the debug
messages are trimmed before they reach the channel. Also set the prefix
while we're at it.
ceph-disk: prefer sgdisk to blkid to retrieve partition UUID
blkid 2.23.2 which is the default for official CentOS 7 cloud images
fails on journal device. It would be better to use blkid because it does
not trigger udev events, but it is more important to get reliable
results.
ceph-disk: make ceph-disk list /dev/vdb equivalent to list vdb
The ceph-disk list argument must be the device name without the leading
/dev/. This is error prone and silently does nothing. Strip the /dev/
prefix of ceph-disk list arguments so that it behaves as expected.
When running ceph-disk trigger /dev/dm-1 with systemd, the path name is
translated into /dev/dm/1 because of systemd escape rules. Explicitly
translate - into \x2d for systemd to preserve the -.
It would be better to use systemd-escape
http://www.freedesktop.org/software/systemd/man/systemd-escape.html
but it does not appear to be generally available on CentOS 7 and
probably other distributions.
ceph-disk: a journal partition may survive a data partition
When a data partition is removed and the journal partition is not
removed, ceph-disk list will not find a journal_for information and
should just ignore it.