Sage Weil [Mon, 16 Jun 2014 23:58:14 +0000 (16:58 -0700)]
mon: refactor check_health()
Refactor the get_health() methods to always take both a summary and detail.
Eliminate the return value and pull that directly from the summary, as we
already do with the PaxosServices.
Sage Weil [Mon, 16 Jun 2014 23:27:05 +0000 (16:27 -0700)]
mon/OSDMonitor: make down osd count sensible
We currently log something like
1/10 in osds are down
in the health warning when there are down OSDs, but this is based on a
comparison of the number of up vs the number of in osds, and makes no sense
when there are up osds that are not in.
Instead, count only the number OSDs that are both down and in (relative to
the total number of OSDs in) and warn about that. This means that, if a
disk fails, and we mark it out, and the cluster fully repairs itself, it
will go back to a HEALTH_OK state.
I think that is a good thing, and certainly preferable to the current
nonsense. If we want to distinguish between down+out OSDs that were failed
vs those that have been "acknowledged" by an admin to be dead, we will
need to add some additional state (possibly reusing the AUTOOUT flag?), but
that will require more discussion.
Ilya Dryomov [Thu, 5 Jun 2014 06:08:42 +0000 (10:08 +0400)]
XfsFileStoreBackend: call ioctl(XFS_IOC_FSSETXATTR) less often
No need to call ioctl(XFS_IOC_FSSETXATTR) if extsize is already set to
the value we want or if any extents are allocated - XFS will refuse to
change extsize in that's the case.
John Spray [Tue, 20 May 2014 15:25:19 +0000 (16:25 +0100)]
mon: Fix default replicated pool ruleset choice
Specifically, in the case where the configured
default ruleset is CEPH_DEFAULT_CRUSH_REPLICATED_RULESET,
instead of assuming ruleset 0 exists, choose the lowest
numbered ruleset.
In the case where an explicit ruleset is passed to
OSDMonitor::prepare_pool_crush_ruleset, verify
that it really exists.
The idea is to eliminate cases where a pool could
exist with its crush ruleset set to something
other than a value ruleset ID.
We were not breaking out of the loop when we filled up the buffer unless
we happened to do so on a pool name boundary. This means that len would
roll over (it was unsigned). In my case, I was not able to reproduce
anything particularly bad since (I think) the strncpy was interpreting the
large unsigned value as signed, but in any case this fixes it, simplifies
the arithmetic, and adds a simple test.
- use a single 'rl' value for the amount of buffer space we want to
consume
- use this to check that there is room and also as the strncat length
- rely on the initial memset to ensure that the trailing 0 is in place.
Sage Weil [Fri, 6 Jun 2014 20:31:29 +0000 (13:31 -0700)]
osd/OSDMap: do not require ERASURE_CODE feature of clients
Just because an EC pool exists in the cluster does not mean tha tthe client
has to support the feature:
1) The way client IO is initiated is no different for EC pools than for
replicated pools.
2) People may add an EC pool to an existing cluster with old clients and
locking those old clients out is very rude when they are not using the
new pool.
3) The only direct client user of EC pools right now is rgw, and the new
versions already need to support various other features like CRUSH_V2
in order to work. These features are present in new kernels.
Sage Weil [Thu, 12 Jun 2014 23:44:53 +0000 (16:44 -0700)]
osd/OSDMap: make get_features() take an entity type
Make the helper that returns what features are required of the OSDMap take
an entity type argument, as the required features may vary between
components in the cluster.
Accela Zhao [Wed, 18 Jun 2014 09:17:03 +0000 (17:17 +0800)]
Make <poolname> in "ceph osd tier --help" clearer.
The ceph osd tier --help info on the left always says <poolname>.
It is unclear which one to put <tierpool> on the right.
$ceph osd tier --help
osd tier add <poolname> <poolname> {-- add the tier <tierpool> to base pool
force-nonempty} <pool>
osd tier add-cache <poolname> add a cache <tierpool> of size <size>
<poolname> <int[0-]> to existing pool <pool>
...
This patch modifies description on the right to tell which <poolname>:
osd tier add <poolname> <poolname> {-- add the tier <tierpool> (the second
force-nonempty} one) to base pool <pool> (the first
one)
...
John Spray [Tue, 20 May 2014 15:50:18 +0000 (16:50 +0100)]
mon: pool set <pool> crush_ruleset must not use rule_exists
Implement CrushWrapper::ruleset_exists that iterates over the existing
rulesets to find the one matching the ruleset argument.
ceph osd pool set <pool> crush_ruleset must not use
CrushWrapper::rule_exists, which checks for a *rule* existing, whereas
the value being set is a *ruleset*. (cherry picked from commit fb504baed98d57dca8ec141bcc3fd021f99d82b0)
A test via ceph osd pool set data crush_ruleset verifies the ruleset
argument is accepted.
Steve Taylor [Tue, 10 Jun 2014 18:42:55 +0000 (12:42 -0600)]
Fix for bug #6700
When preparing OSD disks with colocated journals, the intialization process
fails when using dmcrypt. The kernel fails to re-read the partition table after
the storage partition is created because the journal partition is already in use
by dmcrypt. This fix unmaps the journal partition from dmcrypt and allows the
partition table to be read.
Samuel Just [Fri, 16 May 2014 23:56:33 +0000 (16:56 -0700)]
ReplicatedPG::start_flush: fix clone deletion case
dsnapc.snaps will be non-empty most of the time if there
have been snaps before prev_snapc. What we really want to
know is whether there are any snaps between oi.snaps.back()
and prev_snapc.
Samuel Just [Mon, 12 May 2014 22:08:07 +0000 (15:08 -0700)]
ReplicatedPG::start_flush: send delete even if there are no snaps
Even if all snaps for the clone have been removed, we still have to
send the delete to ensure that when the object is recreated the
new snaps aren't included in the wrong clone.
Greg Farnum [Thu, 22 May 2014 04:41:23 +0000 (21:41 -0700)]
cephfs-java: build against older jni headers
Older versions of the JNI interface expected non-const parameters
to their memory move functions. It's unpleasant, but won't actually
change the memory in question, to do a cast_const in order to satisfy
those older headers. (And even if it *did* modify the memory, that
would be okay given our single user.)
Ilya Dryomov [Fri, 16 May 2014 15:03:13 +0000 (19:03 +0400)]
OSDMonitor: set next commit in mon primary-affinity reply
Commit 8c5c55c8b47e ("mon: set next commit in mon command replies")
fixed MMonCommand replies to include the right version, but the
primary-affinity handler was authored before that. Fix it.
Dmitry Smirnov [Mon, 12 May 2014 04:08:44 +0000 (14:08 +1000)]
prioritise use of `javac` executable (gcj provides it through alternatives).
On Debian this fixes FTBFS when gcj-jdk and openjdk-7-jdk are installed at
the same time because build system will use default `javac` executable
provided by current JDK through `update-alternatives` instead of blindly
calling GCJ when it is present.
Sage Weil [Thu, 8 May 2014 15:52:51 +0000 (08:52 -0700)]
ceph-disk: partprobe before settle when preparing dev
Two users have reported this fixes a problem with using --dmcrypt.
Fixes: #6966 Tested-by: Eric Eastman <eric0e@aol.com> Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 0f196265f049d432e399197a3af3f90d2e916275)
Greg Farnum [Tue, 13 May 2014 20:15:28 +0000 (13:15 -0700)]
test: fix some templates to match new output code
Signed-off-by: Greg Farnum <greg@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 00225d739cefa1415524a3de45fb9a5a2db53018)
Greg Farnum [Thu, 15 May 2014 23:50:43 +0000 (16:50 -0700)]
OSD: fix an osdmap_subscribe interface misuse
When calling osdmap_subscribe, you have to pass an epoch newer than the
current map's. _maybe_boot() was not doing this correctly -- we would
fail a check for being *in* the monitor's existing map range, and then
pass along the map prior to the monitor's range. But if we were exactly
one behind, that value would be our current epoch, and the request would
get dropped. So instead, make sure we are not *in contact* with the monitor's
existing map range.
Signed-off-by: Greg Farnum <greg@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 290ac818696414758978b78517b137c226110bb4)
Sage Weil [Mon, 19 May 2014 17:32:12 +0000 (10:32 -0700)]
osd: skip out of order op checks on tiered pools
When we send redirected ops, we do not assign a new tid, which means that
a given client's ops for a pool may not have strictly ordered tids. Skip
this check if the pool is tiered to avoid false positives.
Sage Weil [Thu, 8 May 2014 17:42:42 +0000 (10:42 -0700)]
mon/OSDMonitor: force op resend when pool overlay changes
If a client is sending a sequence of ops (say, a, b, c, d) and partway
through that sequence it receives an OSDMap update that changes the
overlay, the ops will get send to different pools, and the replies will
come back completely out of order.
To fix this, force a resend of all outstanding ops any time the overlay
changes.
Sage Weil [Thu, 8 May 2014 17:52:11 +0000 (10:52 -0700)]
osdc/Objecter: resend ops in the last_force_op_resend epoch
If we are a client, and process a map that sets last_force_op_resend to
the current epoch, force a resend of this op.
If the OSD expects us to do this, it will discard our previous op. If the
OSD is old, it will process the old one, this will appear as a dup, and we
are no worse off than before.
Sage Weil [Fri, 9 May 2014 16:20:34 +0000 (09:20 -0700)]
osd: handle race between osdmap and prepare_to_stop
If we get a MOSDMarkMeDown message and set service.state == STOPPING, we
kick the prepare_to_stop() thread. Normally, it will wake up and then
set osd.state == STOPPING, and when we process the map message next we
will not warn. However, if dispatch() takes the lock instead and processes
the map, it will fail the preparing_to_stop check and issue a spurious
warning.
Fix by checking for either preparing_to_stop or stopping.
Fixes: #8319
Backport: firefly, emperor, dumpling Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 6b858be0676f937a99dbd51321497f30c3a0097f)
Yehuda Sadeh [Tue, 6 May 2014 23:55:27 +0000 (16:55 -0700)]
rgw: fix stripe_size calculation
Fixes: #8299
Backport: firefly
The stripe size calculation was broken, specifically affected cases
where we had manifest that described multiple parts.
Yehuda Sadeh [Tue, 6 May 2014 18:06:29 +0000 (11:06 -0700)]
rgw: cut short object read if a chunk returns error
Fixes: #8289
Backport: firefly, dumpling
When reading an object, if we hit an error when trying to read one of
the rados objects then we should just stop. Otherwise we're just going
to continue reading the rest of the object, and since it can't be sent
back to the client (as we have a hole in the middle), we end up
accumulating everything in memory.
Sage Weil [Fri, 18 Apr 2014 20:50:11 +0000 (13:50 -0700)]
osd: throttle snap trimmming with simple delay
This is not particularly smart, but it is *a* knob that lets you make
the snap trimmer slow down. It's a flow and a simple delay, so it is
adjustable at runtime. Default is 0 (no change in behavior).
Callig _finish_hunting() clears out the bool hunting flag, which means we
don't retry by connection to another mon periodically. Instead, we send
keepalives every 10s. But, since we aren't yet in state HAVE_SESSION, we
don't check that the keepalives are getting responses. This means that an
ill-timed connection reset (say, after we get a MonMap, but before we
finish authenticating) can drop the monc into a black hole that does not
retry.
Instead, we should *only* call _finish_hunting() when we complete the
authentication handshake.
Fixes: #8278
Backport: firefly, dumpling Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 77a6f0aefebebf057f02bfb95c088a30ed93c53f)
Sage Weil [Tue, 6 May 2014 18:01:27 +0000 (11:01 -0700)]
osd/ReplicatedPG: fix whiteouts for other cache mode
We were special casing WRITEBACK mode for handling whiteouts; this needs to
also include the FORWARD and READONLY modes. To avoid having to list
specific cache modes, though, just check != NONE.
Sage Weil [Thu, 1 May 2014 23:53:17 +0000 (16:53 -0700)]
osd: Prevent divide by zero in agent_choose_mode()
Fixes: #8175
Backport: firefly
Signed-off-by: David Zafman <david.zafman@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit f47f867952e6b2a16a296c82bb9b585b21cde6c8)
David Zafman [Tue, 22 Apr 2014 06:52:04 +0000 (23:52 -0700)]
osd, common: If agent_work() finds no objs to work on delay 5 (default) secs
Add config osd_agent_delay_time of 5 seconds
Honor delay by ignoring agent_choose_mode() calls
Add tier_delay to logger
Treat restart after delay like we were previously idle
Yan, Zheng [Sat, 3 May 2014 21:17:15 +0000 (05:17 +0800)]
osd: check blacklisted clients in ReplicatedPG::do_op()
OSD checks if client is blacklisted only when receiving OSD request.
It's possible that OSD request's sender get blacklisted while OSD
request in in some waiting list.
Samuel Just [Fri, 18 Apr 2014 00:26:17 +0000 (17:26 -0700)]
ReplicatedPG: block scrub on blocked object contexts
Fixes: #8011 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e66f2e36c06ca00c1147f922d3513f56b122a5c0)
Samuel Just [Thu, 24 Apr 2014 19:48:44 +0000 (12:48 -0700)]
ECBackend::continue_recovery_op: handle a source shard going down
get_min_avail_to_read_shards might return an error if there are
no longer enough sources to reconstruct the missing shards.
This is possible if osds went down while we were writing the
previous chunk -- we already notice in check_recovery_sources
if a source goes down during a read.
Samuel Just [Fri, 11 Apr 2014 01:15:30 +0000 (18:15 -0700)]
ReplicatedPG: do not preserve op context during flush
Any information stashed in the OpContext may be obsolete by the time we
actually mark the object clean. Instead, let the start_flush caller
clean up its OpContext and in try_flush_mark_clean we'll create a new
one. The primary reason to keep the OpContext would have been locking,
but we can set the obc as blocking without holding an OpContext, and
that would allow trimming to happen in the mean time (which is good
since trim_object does not respect rw locks since it doesn't change user
visible state). In try_flush_mark_clean, we requeue the fop->op along
with (but ahead of) the fop->dup_ops.