Sage Weil [Thu, 17 Apr 2014 20:11:54 +0000 (13:11 -0700)]
osd/ReplicatedPG: check clones for degraded
We check whether the head is degraded, and we check whether a clone is
unreadable, but in the case where we have a cache op on a degraded object,
we don't check. That leads to an assert when the repop hits the replica
and the object is in the peer's missing set.
Fix this by adding a check on the clone when write_ordered is true. Note
that checking write_ordered is better than whether it is a cache op because
we want to preserve write ordering even for reads that are flagged by the
client.
Fixes: #8048 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Thu, 17 Apr 2014 17:48:26 +0000 (10:48 -0700)]
osdc/Objecter: fix osd target for newly-homeless op
If we recalculate the mapping and find that there is no primary, we need
to set the 'osd' field to -1. Otherwise, the caller will try to resend
to a dead session with bad results.
John Spray [Thu, 17 Apr 2014 14:28:22 +0000 (15:28 +0100)]
mon: EBUSY instead of EAGAIN when pgs creating
In 69321bf, EAGAIN changed behaviour to block indefinitely
rather than returning to user. Change the return for
`osd pool set` operations that are blocked by creating PGs
to return EBUSY instead of EAGAIN, so that they are excepted
from this blocking behaviour.
Signed-off-by: John Spray <john.spray@inktank.com>
Sage Weil [Tue, 15 Apr 2014 20:57:21 +0000 (13:57 -0700)]
mon/OSDMonitor: require force argument to split a cache pool
There are several perils when splitting a cache pool:
- split invalidstes pg stats, which disables the agent
- a scrub must be manually triggered post-split to rebuild stats
- the pool may fill the OSDs during that period.
- or, the pool may end up beyond the 'full' mark and once scrub does
complete and the agent activate we may block IO for a long time while
we catch up with flush/evict
Make it a bit harder for users to shoot themselves in the foot.
Fixes: #8043 Signed-off-by: Sage Weil <sage@inktank.com>
John Spray [Mon, 14 Apr 2014 16:14:42 +0000 (17:14 +0100)]
mds: Fix respawn (add path resolution)
Previously assumed that ceph-mds executable was in
PWD - now use /proc/self/exe to find the
executable whereever it may be. Leave in old version
as a fallback for non-linux environments.
Also add a 'respawn' command so that it's easy to test
respawn with `ceph mds tell <id> respawn`
Dan Mick [Wed, 9 Apr 2014 04:06:55 +0000 (21:06 -0700)]
Use cpp_strerror() wherever possible, and use autoconf for portability
strerror_r is not portable; on Gnu libc it returns char * and sometimes
does not fill in the supplied buffer. Use autoconf to test which
version this platform uses and adapt.
Clean up the random calls to strerror and strerror_r (along with all
their private little one-use buffers) and regularize the code to use
cpp_strerror almost everywhere. Where changed, any negation of the
error code is also removed, since cpp_strerror() will do that.
Note: some tools were using their own calls to strerror/strerror_r, so
will now get a (%d) in their output that wasn't there before; hence
the change to test/cli/monmaptool/print-nonexistent.t
Fixes: #8041 Signed-off-by: Dan Mick <dan.mick@inktank.com>
Sage Weil [Mon, 14 Apr 2014 04:31:35 +0000 (21:31 -0700)]
osd/ReplicatedPG: handle dup ops earlier in do_op
Current the dup op checks happen in execute_ctx, long after we handle
cache ops or get the obc and (potentially) return ENOENT. That means that
object deletions and cache ops both aren't properly idempotent.
This is easy to fix by moving the check earlier in do_op.
Fixes: #8089 Signed-off-by: Sage Weil <sage@inktank.com>
mds: don't issue/revoke caps before client has caps
If early reply is not allowed, MDS does not send reply to client immediately
after Locker::issue_new_caps adds new caps. So MDS can revoke the caps before
sending reply to client.
MDCache::do_file_recover may call Locker::evel_gather, which may change
filelock to stable state. So we should authpin the inode (for unstable
lock state) first.
Sage Weil [Sun, 13 Apr 2014 05:23:26 +0000 (22:23 -0700)]
osd/ReplicatedPG: handle misdirected do_command
We can get a query on a pg we still have but are no longer primary for. If
that happens, do not reply. The client will resend to the correct OSD
assuming it has the map. Send them the latest incremental so that we know
they know there is something new. We don't know the exact epoch they have,
unfortunately, because MCommand doesn't include it, but a newer inc is
enough to make them request the right incrementals from a mon. Eventually
they will figure it out and Objecter will resend the request to the
correct target.
It is possible we should include epoch in the MCommand message so that we
can do this mapping "correctly" (as in, the same way MOSDOp does). That
makes MCommand less general, though... a PG-specific command message might
be the most precise thing. Another day...
Fixes: #8085 Signed-off-by: Sage Weil <sage@inktank.com>
David Zafman [Fri, 11 Apr 2014 00:16:33 +0000 (17:16 -0700)]
librados: Add ObjectWriteOperation::snap_rollback() for pool snapshots
snap_rollback() is the same as selfmanaged_snap_rollback() but we want an
independent interface for pool snapshots. Should really take snapname
for consistency with other pool snapshot interfaces.
Signed-off-by: David Zafman <david.zafman@inktank.com>
Sage Weil [Fri, 11 Apr 2014 22:39:23 +0000 (15:39 -0700)]
osd/PG: fix repair_object when missing on primary
If the object is missing on the primary, we need to fully populate the
missing_loc.needs_recovery_map. This broke with the recent refactoring of
recovery for EC, somewhere around 84e2f39c557c79e9ca7c3c3f0eb0bfa4860bf899.
Fixes: #8008 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Fri, 11 Apr 2014 21:48:26 +0000 (14:48 -0700)]
ceph_test_librados_tier: tolerage EAGAIN from pg scrub command
We may get EAGAIN if the osd happens to be down, for example due to
thrashing. Try a few times and then give up.
Note that the other place we try to scrub we don't even check the return
value as we are poking ever pg in the pool. And the scrub commands get
lost due to any peering event, etc.
Sage Weil [Fri, 11 Apr 2014 21:32:21 +0000 (14:32 -0700)]
mon/OSDMonitor: fix osd epoch in boot check
This was introduced in 4c99e978a77a242e540cb8ccacb967d24322416c and was
incorrect; boot_epoch is the previous epoch the osd booted in, not the
latest map epoch that the OSD currently has.
Sage Weil [Fri, 11 Apr 2014 20:14:58 +0000 (13:14 -0700)]
osd/ReplicatedPG: skip missing hit_sets when loading into memory
We weren't handling hit_sets that were missing.
Two changes here:
1- Load the hit_sets oldest to newest. That means that if we stop partway
through loading, and then add another to the end of the list, and then
try again to load some more, we will still catch them all.
2- If the object is missing, stop. We'll try again the next time
agent_work() is called.
Fixes: #8077 Signed-off-by: Sage Weil <sage@inktank.com>
It's important to assign these for all operations for cases where
g_lockdep isn't yet true when the constructor runs. This is true
for the HeartbeatMap rwlock, among other things, as that thread
is created during early startup before lockdep is enabled. All
of the lockdep hooks assume that they can assign ids on the fly
and not tracking them here breaks things.
test:
Add set_completion*PP() functions to cast arg to correct class
Add return_value checks
Add some reads with buffers larger than object size
Check buffer length on reads
librados:
Make sure *return_value() has bytes read in all cases
Signed-off-by: David Zafman <david.zafman@inktank.com>
David Zafman [Wed, 2 Apr 2014 18:54:51 +0000 (11:54 -0700)]
librados, test: Have write, append and write_full return 0 on success
Fix consistency of write, append, write_full, all return 0 on success
Include C (rados_*) variants, C++ ctx variants
and aio get_return_value() and rados_aio_get_return_value()
Signed-off-by: David Zafman <david.zafman@inktank.com>
Sage Weil [Thu, 10 Apr 2014 20:34:58 +0000 (13:34 -0700)]
mon/OSDMonitor: ignore boot message from before last up_from
It is possible we will have a dup OSDBoot message queued up in the mon
and will process it again after that osd was marked up and then down. If
that happens, we should ignore this message, not mark the osd back in with
the same address.
Fixes: #8062 Signed-off-by: Sage Weil <sage@inktank.com>
Prevoiusly we assumed that if we had snapset_obc set, !exists on the head
and if we got the snapdir lock we were good to take the head lock too.
This is no the case when:
- delete queued
- takes wr lock on both head and snapdir
- delete commits (but not yet applied)
- stat
- tries to take wr lock on head
- blocks, toggles w=1 state on *head only*
- copy-from
- tries to take wr lock on snapdir, succeeds
- tries to take wr lock on head, fails because w=1
- fails the assert(got)
The problem is that the read and write paths are taking different locks
and we are expecting them to operate in synchrony.
Fix this by using the same ordering for reads as well as write: if the
snapset_obc is defined, take the read lock on that too, just as we do with
a write.
Fixes: #8046 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
mon: Monitor: suicide on start if mon has been removed from monmap
If the monitor has been marked as having been part of an existing quorum
and is no longer in the monmap, then it is safe to assume the monitor
was removed from the monmap. In that event, do not allow the monitor
to start, as it will try to find its way into the quorum again (and
someone clearly stated they don't really want them there), unless
'mon force quorum join' is specified.
Fixes: 6789
Backport: dumpling, emperor
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
mds: guarantee message ordering when importing non-auth caps
Current code allow importing non-auth caps when inode is being exported.
This can breaks message ordering because the corresponding cap import
messages are sent after the flush session messages. So they can arrive
at clients after clients have already received cap import messages from
new auth MDS of the inode.
The quick fix is ignore MExportCaps when inode is frozen.
mds: remove wrong assertion for remote frozen authpin
For across authority rename, the MDS first freezes the source inode's
authpin. It happens while the source dentry isn't locked. So when the
inode's authpin become frozen, the source dentry may have changed and
be linked to a different inode.
Sage Weil [Thu, 10 Apr 2014 01:02:27 +0000 (18:02 -0700)]
osdc/Objecter: move mapping into struct, helper
Move the common bits of Op and LingerOp into op_target_t and separate the
actual mapping calculation into calc_target(). This hugely simplifies
recal_*op_target() by mostly just shuffling all of the same logic into
that helper.
There is one functional change in this patch: recalc_linger_op() now is
aware of the tiering logic that was previously only handled in
recalc_op_target().
Sage Weil [Wed, 9 Apr 2014 23:03:05 +0000 (16:03 -0700)]
mon: tell peers missing features during probe
Use a new probe op to inform mons that they are missing features during
the earliest probe phase. This prevents them from getting as far as
the sync entirely if they are too old.
We still need to refuse to speak to them if they try to call an election,
which they could do based on their replies from other peers.
Note that old clients will assert on getting a message type string they
don't understand, so we need to be careful not to send the probe reply
to older clients. The feature bit we use is not precise in that it does
not cover recent dev releases, but it does work for dumpling and emperor.