client: check cap ID when handling cap export message
handle following sequence of events:
- mds0 exports an inode to mds1. client receives the cap import
message from mds1. caps from mds0 are removed while handling
the cap import message.
- mds1 exports an inode to mds0. client receives the cap export
message from mds1. handle_cap_export() adds placeholder caps
for mds0
- client receives the first cap export message (for exporting
inode from mds0 to mds1)
Sage Weil [Sun, 20 Apr 2014 05:04:33 +0000 (22:04 -0700)]
osd: change in up set primary constitutes a peering interval change
In several places, a change in the up_primary triggers a new peering
interval, but the palces that actually generate the new past intervals,
including check_new_interval(), did not enforce that. This becomes
somewhat obvious when you see that those callers are ignoring the
up_primary output argument for pg_to_up_acting_osds().
Fix this by adding arguments to check_new_interval and fixing the callers
to pass them in properly. Add a unit test case to verify this.
Note that the past interval struct itself does not record who the
up_primary was; possibly it should.
Fixes: #8139 Signed-off-by: Sage Weil <sage@inktank.com>
Samuel Just [Tue, 8 Apr 2014 21:03:59 +0000 (14:03 -0700)]
ReplicatedPG: do not create whiteout clones
First, make_writeable treats whiteout heads like snapdir for
cloning purposes. Second, to ensure that we send the correct
deletes on flush to the backing pool, we instead use oi.snaps
on any clone we are flushing to infer the snaps during which
head did not exist and send a delete as appropriate prior to
the copy_from.
Normally, we'd have a problem if the delete and the copy_from
completed, but an interval change intervened before the dirty
flag was cleared since we'd end up re-deleting the object.
To avoid that, we use the CEPH_OSD_FLAG_ORDERSNAP flag.
Additionally, we will use the correct snap_seq on the delete
or flush as appropriate to ensure that the previous clone
gets created with the same clone id as in the cache pool.
Fixes: #7942 Signed-off-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Thu, 17 Apr 2014 19:27:07 +0000 (12:27 -0700)]
osd/: propogate hit_set history with repop
We don't actually send the whole info on each repop, just the log
entries, updated stats, and a few other bits. For hit_set ops, we need
to also communicate the new hit_set history status atomically with the
log entries and the transaction. Thus, we add a channel for an optional
pg_hit_set_history_t field in PGBackend::submit_transaction interface
and associated messages and implementations to update the hit_set info
field along with the log entries.
This also means that hit_set_(persist|trim) update an
updated_hit_set_history field on the OpContext instead of directly
modifying the info field.
Fixes: #8124 Signed-off-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Fri, 18 Apr 2014 18:12:23 +0000 (11:12 -0700)]
mon: wait for PaxosService readable in handle_get_version
We were waiting for the election to finish, but we need to *also* wait for
paxos to recover. Being a peon or leader is not sufficient and we may
return a map that is still old.
Fixes: #7997 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Thu, 17 Apr 2014 20:11:54 +0000 (13:11 -0700)]
osd/ReplicatedPG: check clones for degraded
We check whether the head is degraded, and we check whether a clone is
unreadable, but in the case where we have a cache op on a degraded object,
we don't check. That leads to an assert when the repop hits the replica
and the object is in the peer's missing set.
Fix this by adding a check on the clone when write_ordered is true. Note
that checking write_ordered is better than whether it is a cache op because
we want to preserve write ordering even for reads that are flagged by the
client.
Fixes: #8048 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Thu, 17 Apr 2014 17:48:26 +0000 (10:48 -0700)]
osdc/Objecter: fix osd target for newly-homeless op
If we recalculate the mapping and find that there is no primary, we need
to set the 'osd' field to -1. Otherwise, the caller will try to resend
to a dead session with bad results.
Sage Weil [Thu, 17 Apr 2014 16:33:44 +0000 (09:33 -0700)]
mon: set leader commands prior to first election
If we have just started and receive a command, we currently will reply with
EINVAL because the leader commands are empty. Note that this race is very
difficult to reach because the (old) peon needs to forward a command to
the mon while it still thinks it has quorum, and the message needs to get
sent after the leader mon has restarted and reset its connection but before
it has declared a new election.
To fix this, we should assume at startup time that our commands are
valid. If it is an internal command that does not require quorum, that
is fine. If it does require quorum, we will retry the command after the
election completes and we will revalidate the command then.
Fixes: #8132 Signed-off-by: Sage Weil <sage@inktank.com>
John Spray [Thu, 17 Apr 2014 14:28:22 +0000 (15:28 +0100)]
mon: EBUSY instead of EAGAIN when pgs creating
In 69321bf, EAGAIN changed behaviour to block indefinitely
rather than returning to user. Change the return for
`osd pool set` operations that are blocked by creating PGs
to return EBUSY instead of EAGAIN, so that they are excepted
from this blocking behaviour.
Signed-off-by: John Spray <john.spray@inktank.com>
mds: dynamically adjust priority of committing dirfrags
Adjust priority of committing dirfrags according to number of
expiring log segments. The more expiring log segments, the higher
priority. Because it mean MDS does not trim log segments quickly
enough.
Sage Weil [Tue, 15 Apr 2014 20:57:21 +0000 (13:57 -0700)]
mon/OSDMonitor: require force argument to split a cache pool
There are several perils when splitting a cache pool:
- split invalidstes pg stats, which disables the agent
- a scrub must be manually triggered post-split to rebuild stats
- the pool may fill the OSDs during that period.
- or, the pool may end up beyond the 'full' mark and once scrub does
complete and the agent activate we may block IO for a long time while
we catch up with flush/evict
Make it a bit harder for users to shoot themselves in the foot.
Fixes: #8043 Signed-off-by: Sage Weil <sage@inktank.com>
John Spray [Mon, 14 Apr 2014 16:14:42 +0000 (17:14 +0100)]
mds: Fix respawn (add path resolution)
Previously assumed that ceph-mds executable was in
PWD - now use /proc/self/exe to find the
executable whereever it may be. Leave in old version
as a fallback for non-linux environments.
Also add a 'respawn' command so that it's easy to test
respawn with `ceph mds tell <id> respawn`
mds: share max size to client who is allowed for WR cap
WR cap is allowed for the loner client when filelock is in excl->mix
state. MDS should share max size with the loner client in this case.
Otherwise the client may wait for the max size forever.