Sage Weil [Wed, 2 Jul 2014 17:38:43 +0000 (10:38 -0700)]
qa/workunits/rest/test.py: make osd create test idempotent
Avoid possibility that we create multiple OSDs do to retries by passing in
the optional uuid arg. (A stray osd id will make the osd tell tests a
few lines down fail.)
Loic Dachary [Wed, 18 Jun 2014 22:49:13 +0000 (00:49 +0200)]
erasure-code: create default profile if necessary
After an upgrade to firefly, the existing Ceph clusters do not have the
default erasure code profile. Although it may be created with
ceph osd erasure-code-profile set default
it was not included in the release notes and is confusing for the
administrator.
The *osd pool create* and *osd crush rule create-erasure* commands are
modified to implicitly create the default erasure code profile if it is
not found.
In order to avoid code duplication, the default erasure code profile
code creation that happens when a new firefly ceph cluster is created is
encapsulated in the OSDMap::get_erasure_code_profile_default method.
Conversely, handling the pending change in OSDMonitor is not
encapsulated in a function but duplicated instead. If it was a function
the caller would need a switch to distinguish between the case when goto
wait is needed, or goto reply or proceed because nothing needs to be
done. It is unclear if having a function would lead to smaller or more
maintainable code.
Sage Weil [Fri, 9 May 2014 15:41:33 +0000 (08:41 -0700)]
osd: cancel agent_timer events on shutdown
We need to cancel all agent timer events on shutdown. This also needs to
happen early so that any in-progress events will execute before we start
flushing and cleaning up PGs.
Sage Weil [Tue, 8 Jul 2014 23:11:44 +0000 (16:11 -0700)]
osd: s/applying repop/canceling repop/
The 'applying' language dates back to when we would wait for acks from
replicas before applying writes locally. We don't do any of that any more;
now, this loop just cancels the repops with remove_repop() and some other
cleanup.
Sage Weil [Tue, 8 Jul 2014 23:10:58 +0000 (16:10 -0700)]
osd: separate cleanup from PGBackend::on_change()
The generic portion of on_change() cleaned up temporary on-disk objects
and requires a Transaction. The rest is clearing out in-memory state and
does not. Separate the two.
mon: AuthMonitor: always encode full regardless of keyserver having keys
On clusters without cephx, assuming an admin never added a key to the
cluster, the monitors have empty key servers. A previous patch had the
AuthMonitor not encoding an empty keyserver as a full version.
As such, whenever the monitor restarts we will have to read the whole
state from disk in the form of incrementals. This poses a problem upon
trimming, as we do every now and then: whenever we start the monitor, it
will start with an empty keyserver, waiting to be populated from whatever
we have on disk. This is performed in update_from_paxos(), and the
AuthMonitor's will rely on the keyserver version to decide which
incrementals we care about -- basically, all versions > keyserver version.
Although we started with an empty keyserver (version 0) and are expecting
to read state from disk, in this case it means we will attempt to read
version 1 first. If the cluster has been running for a while now, and
even if no keys have been added, it's fair to assume that version is
greater than 0 (or even 1), as the AuthMonitor also deals and keeps track
of auth global ids. As such, we expect to read version 1, then version 2,
and so on. If we trim at some point however this will not be possible,
as version 1 will not exist -- and we will assert because of that.
This is fixed by ensuring the AuthMonitor keeps track of full versions
of the key server, even if it's of an empty key server -- it will still
keep track of the key server's version, which is incremented each time
we update from paxos even if it is empty.
rgw: account common prefixes for MaxKeys in bucket listing
To be more in line with the S3 api. Beforehand we didn't account the
common prefixes towards the MaxKeys (a single common prefix counts as a
single key). Also need to adjust the marker now if it is pointing at a
common prefix.
If found a prefix, calculate a string greater than that so that next
request we can skip to that. This is still not the most efficient way to
do it. It'll be better to push it down to the objclass, but that'll
require a much bigger change.
Sage Weil [Thu, 17 Jul 2014 23:40:06 +0000 (16:40 -0700)]
logrotate.conf: fix osd log rotation under upstart
In commit 7411c3c6a42bef5987bdd76b1812b01686303502 we generalized this
enumeration code by copying what was in the upstart scripts. However,
while the mon and mds directories get a 'done' file, the OSDs get a 'ready'
file. Bah! Trigger off of either one.
rgw: don't try to wait for pending if list is empty
Fixes: #8846
Backport: firefly, dumpling
This was broken at ea68b9372319fd0bab40856db26528d36359102e. We ended
up calling wait_pending_front() when pending list was empty.
This commit also moves the need_to_wait check to a different place,
where we actually throttle (and not just drain completed IOs).
Sage Weil [Wed, 16 Jul 2014 01:11:41 +0000 (18:11 -0700)]
init-ceph: wrap daemon startup with systemd-run when running under systemd
We want to make sure the daemon runs in its own systemd environment. Check
for systemd as pid 1 and, when present, use systemd-run -r <cmd> to do
this.
Probably fixes #7627
Signed-off-by: Sage Weil <sage@redhat.com> Reviewed-by: Dan Mick <dan.mick@inktank.com> Tested-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 3e0d9800767018625f0e7d797c812aa44c426dab)
A while ago we bumped the head version and reset the compat version to 0.
Doing this so happens to make the messenger assume that the message does
not support the compat versioning and sets the compat version to the head
version -- thus making compat = 2 when it should have been 1.
The nasty side-effect of this is that upgrading from emperor to firefly
will have emperor-leaders being unable to decode forwarded messages from
firefly-peons.
Dan Mick [Thu, 3 Jul 2014 23:08:44 +0000 (16:08 -0700)]
Fix/add missing dependencies:
- rbd-fuse depends on librados2/librbd1
- ceph-devel depends on specific releases of libs and libcephfs_jni1
- librbd1 depends on librados2
- python-ceph does not depend on libcephfs1
mon: check changes to the whole CRUSH map and to tunables against cluster features
When we change the tunables, or set a new CRUSH map, we need to make sure it's
supported by all the monitors and OSDs currently participating in the cluster.
Disable this test until hitget-get reliably works on EC pools (currently
it does not, and this test usually passes only because we get the in-memory
HitSet).
Yehuda Sadeh [Wed, 11 Jun 2014 23:50:41 +0000 (16:50 -0700)]
rgw: set a default data extra pool name
Fixes: #8585
Have a default name for the data extra pool, otherwise it would be empty
which means that it'd default to the data pool name (which is a problem
with ec backends).
Yehuda Sadeh [Tue, 6 May 2014 22:35:20 +0000 (15:35 -0700)]
rgw: extend manifest to avoid old style manifest
In case we hit issue #8269 we'd like to avoid creating an old style
manifest. Since we need to have parts that use different prefix we add a
new rule param that overrides the manifest prefix.
Yehuda Sadeh [Sat, 3 May 2014 00:06:05 +0000 (17:06 -0700)]
rgw: don't allow multiple writers to same multiobject part
Fixes: #8269
Backport: firefly, dumpling
A client might need to retry a multipart part write. The original thread
might race with the new one, trying to clean up after it, clobbering the
part's data.
The fix is to detect whether an original part already existed, and if so
use a different part name for it.
Sage Weil [Mon, 16 Jun 2014 23:58:14 +0000 (16:58 -0700)]
mon: refactor check_health()
Refactor the get_health() methods to always take both a summary and detail.
Eliminate the return value and pull that directly from the summary, as we
already do with the PaxosServices.
Sage Weil [Mon, 16 Jun 2014 23:27:05 +0000 (16:27 -0700)]
mon/OSDMonitor: make down osd count sensible
We currently log something like
1/10 in osds are down
in the health warning when there are down OSDs, but this is based on a
comparison of the number of up vs the number of in osds, and makes no sense
when there are up osds that are not in.
Instead, count only the number OSDs that are both down and in (relative to
the total number of OSDs in) and warn about that. This means that, if a
disk fails, and we mark it out, and the cluster fully repairs itself, it
will go back to a HEALTH_OK state.
I think that is a good thing, and certainly preferable to the current
nonsense. If we want to distinguish between down+out OSDs that were failed
vs those that have been "acknowledged" by an admin to be dead, we will
need to add some additional state (possibly reusing the AUTOOUT flag?), but
that will require more discussion.
Yehuda Sadeh [Wed, 4 Jun 2014 16:24:33 +0000 (09:24 -0700)]
rgw: set meta object in extra flag when initializing it
As part of the fix for 8452 we moved the meta object initialization.
Missed moving the extra flag initialization that is needed. This breaks
setups where there's a separate extra pool (needed in ec backends).
Yehuda Sadeh [Mon, 16 Jun 2014 18:48:24 +0000 (11:48 -0700)]
rgw: allocate enough space for bucket instance id
Fixes: #8608
Backport: dumpling, firefly
Bucket instance id is a concatenation of zone name, rados instance id,
and a running counter. We need to allocate enough space to account zone
name length.
Ilya Dryomov [Thu, 5 Jun 2014 06:08:42 +0000 (10:08 +0400)]
XfsFileStoreBackend: call ioctl(XFS_IOC_FSSETXATTR) less often
No need to call ioctl(XFS_IOC_FSSETXATTR) if extsize is already set to
the value we want or if any extents are allocated - XFS will refuse to
change extsize in that's the case.
John Spray [Tue, 20 May 2014 15:25:19 +0000 (16:25 +0100)]
mon: Fix default replicated pool ruleset choice
Specifically, in the case where the configured
default ruleset is CEPH_DEFAULT_CRUSH_REPLICATED_RULESET,
instead of assuming ruleset 0 exists, choose the lowest
numbered ruleset.
In the case where an explicit ruleset is passed to
OSDMonitor::prepare_pool_crush_ruleset, verify
that it really exists.
The idea is to eliminate cases where a pool could
exist with its crush ruleset set to something
other than a value ruleset ID.
Fixes: #8169
Backport: firefly
We didn't calculate the user manifest's object etag at all. The etag
needs to be the md5 of the contantenation of all the parts' etags.
We were not breaking out of the loop when we filled up the buffer unless
we happened to do so on a pool name boundary. This means that len would
roll over (it was unsigned). In my case, I was not able to reproduce
anything particularly bad since (I think) the strncpy was interpreting the
large unsigned value as signed, but in any case this fixes it, simplifies
the arithmetic, and adds a simple test.
- use a single 'rl' value for the amount of buffer space we want to
consume
- use this to check that there is room and also as the strncat length
- rely on the initial memset to ensure that the trailing 0 is in place.
Sage Weil [Fri, 6 Jun 2014 20:31:29 +0000 (13:31 -0700)]
osd/OSDMap: do not require ERASURE_CODE feature of clients
Just because an EC pool exists in the cluster does not mean tha tthe client
has to support the feature:
1) The way client IO is initiated is no different for EC pools than for
replicated pools.
2) People may add an EC pool to an existing cluster with old clients and
locking those old clients out is very rude when they are not using the
new pool.
3) The only direct client user of EC pools right now is rgw, and the new
versions already need to support various other features like CRUSH_V2
in order to work. These features are present in new kernels.
Sage Weil [Thu, 12 Jun 2014 23:44:53 +0000 (16:44 -0700)]
osd/OSDMap: make get_features() take an entity type
Make the helper that returns what features are required of the OSDMap take
an entity type argument, as the required features may vary between
components in the cluster.
Accela Zhao [Wed, 18 Jun 2014 09:17:03 +0000 (17:17 +0800)]
Make <poolname> in "ceph osd tier --help" clearer.
The ceph osd tier --help info on the left always says <poolname>.
It is unclear which one to put <tierpool> on the right.
$ceph osd tier --help
osd tier add <poolname> <poolname> {-- add the tier <tierpool> to base pool
force-nonempty} <pool>
osd tier add-cache <poolname> add a cache <tierpool> of size <size>
<poolname> <int[0-]> to existing pool <pool>
...
This patch modifies description on the right to tell which <poolname>:
osd tier add <poolname> <poolname> {-- add the tier <tierpool> (the second
force-nonempty} one) to base pool <pool> (the first
one)
...
John Spray [Tue, 20 May 2014 15:50:18 +0000 (16:50 +0100)]
mon: pool set <pool> crush_ruleset must not use rule_exists
Implement CrushWrapper::ruleset_exists that iterates over the existing
rulesets to find the one matching the ruleset argument.
ceph osd pool set <pool> crush_ruleset must not use
CrushWrapper::rule_exists, which checks for a *rule* existing, whereas
the value being set is a *ruleset*. (cherry picked from commit fb504baed98d57dca8ec141bcc3fd021f99d82b0)
A test via ceph osd pool set data crush_ruleset verifies the ruleset
argument is accepted.
Steve Taylor [Tue, 10 Jun 2014 18:42:55 +0000 (12:42 -0600)]
Fix for bug #6700
When preparing OSD disks with colocated journals, the intialization process
fails when using dmcrypt. The kernel fails to re-read the partition table after
the storage partition is created because the journal partition is already in use
by dmcrypt. This fix unmaps the journal partition from dmcrypt and allows the
partition table to be read.
Samuel Just [Fri, 16 May 2014 23:56:33 +0000 (16:56 -0700)]
ReplicatedPG::start_flush: fix clone deletion case
dsnapc.snaps will be non-empty most of the time if there
have been snaps before prev_snapc. What we really want to
know is whether there are any snaps between oi.snaps.back()
and prev_snapc.