In the placement group code, track prior_set_lost. This fixes a bug
where a new OSDMap updates an OSD's lost_at time, but the PG code does
not update the PG data structures.
When clearing the peering state, call clear_prior() rather than manually
clearing every prior set.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Greg Farnum [Mon, 18 Oct 2010 18:23:50 +0000 (11:23 -0700)]
messenger: Make sure to unlock existing->pipe_lock. There are a few cases in the "open" section where we can go to fail_unlocked while still holding existing->pipe_lock. So unlock it.
Sage Weil [Thu, 21 Oct 2010 23:15:03 +0000 (16:15 -0700)]
client: fix dcache removal during multiple frags
We remove unexpected dentries from our cache while processing mds results.
Results are ordered within a frag, but not between them. Since we can
have multiple frags, only remove results for the current frag, to avoid
removing items from earlier frags.
Sage Weil [Thu, 21 Oct 2010 18:37:45 +0000 (11:37 -0700)]
objecter: reconnect on osd disconnect
If the connection closes to an OSD, we need to reconnect and resubmit our
ops. Otherwise we just hang. This is problematic if it is a transient
error, since we'll only retry if the OSDMap reflects a change, and that
won't happen for transient network/socket errors and such.
Objecter::shutdown() needs to call Timer::join() to ensure that
concurrently exectuting events in other threads get flushed before the
Objecter and its Timer are destroyed.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Greg Farnum [Tue, 19 Oct 2010 15:57:22 +0000 (08:57 -0700)]
Revert "Revert "messenger: introduce a "halt_delivery" flag, checked by queue_delivery.""
This reverts commit d44267c2d6a77d4a3cda1e44ec7c58a19be51cc4.
The problem with this code was that it's possible for the Pipe
to be reused after calling discard_queue(), and we didn't
account for that. So, with this revert, it now sets halt_delivery=false
at the end of discard_queue() and the Pipe is ready for continued use.
Sage Weil [Mon, 18 Oct 2010 20:28:42 +0000 (13:28 -0700)]
filestore: deliberate crash on ENOSPC or EIO
Neither of these are handled, so crash when we hit them. This ensures we
don't blindly continue on with a partially applied transaction and corrupt
our store any further.
Signed-off-by: Sage Weil <sage@newdream.net>
Conflicts:
Sage Weil [Mon, 18 Oct 2010 20:28:07 +0000 (13:28 -0700)]
filestore: deliberate crash on ENOSPC or EIO
Neither of these are handled, so crash when we hit them. This ensures we
don't blindly continue on with a partially applied transaction and corrupt
our store any further.
Greg Farnum [Mon, 18 Oct 2010 18:23:50 +0000 (11:23 -0700)]
messenger: Make sure to unlock existing->pipe_lock.
There are a few cases in the "open" section where we can go to
fail_unlocked while still holding existing->pipe_lock. So unlock it.
Sage Weil [Fri, 15 Oct 2010 16:37:40 +0000 (09:37 -0700)]
mds: use correct helper when pinning past snaprealm parent
The heler also updates the SnapRealm::open_past_parents, which is needed
for the have_past_parents_open() check.
That is used when, among other things, we import caps; not updating it
prevented the cap import from sending the client cap message, which makes
the mds<->client cap relationship get out of sync.
Sage Weil [Fri, 15 Oct 2010 14:59:29 +0000 (07:59 -0700)]
mds: take nestlock wrlock when projecting rstat into dirfrag
We were already checking that we _can_ wrlock before doing the rstat
projection (if we can't, we mark_dirty_rstat() on the inode), but we
weren't actually taking the wrlock to prevent lock state changes while
that happened.
This bug eventually manifested itself as a failed assertion at the
now familiar
mds/CInode.cc: In function 'virtual void CInode::decode_lock_state(int, ceph::bufferlist&)':
mds/CInode.cc:1364: FAILED assert(pf->rstat == rstat)
Greg Farnum [Fri, 15 Oct 2010 18:21:02 +0000 (11:21 -0700)]
messenger: introduce timeouts on pipes.
This will return read errors on a pipe if it gets no data
for the given period of time (default 15 minutes). In a stateful
session the Connection will hang around and the next write will
initiate standard reconnect, so things keep working but we don't
rack up hundreds of useless threads!
Sage Weil [Thu, 14 Oct 2010 22:07:16 +0000 (15:07 -0700)]
mds: fix can_scatter_pin() to be only SYNC and MIX
Those are the only states where the replica can effectively prevent the
lock from cycling in a way that would force a frozen dirfrag beneath
the scatterpinned inode to update/journal something
(accounted_fragstat/rstat).
If the user has turned on journalling, but left osd_journal_size at 0,
normally we would use the existing size of the journal without
modifications. If the journal doesn't exist (i.e., we are running
mkjournal()), we have to check for this condition and return an error.
We can't create a journal if we don't know what size that journal needs
to be.
This fixes a bug where an extremely small journal file was being
created, leading to an infinite loop in FileJournal::wrap_read_bl().
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Greg Farnum [Mon, 11 Oct 2010 16:40:29 +0000 (09:40 -0700)]
mds: Fix projection in rename code paths.
We aren't actually projecting the inode unless destdn->is_auth(),
so check for that before projecting the snaprealm (which requires
a projected inode)!
Then on rename_apply, open the snaprealm on non-auth MDSes.
This was causing a mis-match in the projection code, since
assimilate_...finish() calls pop_and_dirty_projected_inode(), but
the first half is only called on CEPH_LOCK_INEST locks. So make them match!
Greg Farnum [Thu, 7 Oct 2010 20:04:26 +0000 (13:04 -0700)]
mds: MDCache should adjust_nested_anchors once the op's been logged.
Fixes crashes from assert(nested_anchors >= 0) failures
when updating at the wrong point.
Sage Weil [Tue, 12 Oct 2010 04:25:17 +0000 (21:25 -0700)]
mds: avoid EXCL if mds_caps_wanted in _do_cap_update
The file_excl() trigger asserts mds_caps_wanted is empty. The caller
shouldn't call it if that's the case. If it is, just go to LOCK instead.
All we're doing is picking a state to move to that will allow us to
update max_size.
Sage Weil [Tue, 12 Oct 2010 03:51:19 +0000 (20:51 -0700)]
mds: bump rstat version in predirty_journal_parents
When we propagate the rstat to inode in predirty_journal_parents (because
we hold the nestlock), bump the rstat version as well. This avoids
confusing any replicas, who expect the rstat to have a new version or to
remain unchanged when the lock scatters.
Sage Weil [Fri, 8 Oct 2010 17:45:51 +0000 (10:45 -0700)]
filestore: don't start commit if nothing new is _applied_
We were starting a commit if we had started a new op, but that left a
window in which the op could be being journaled, and nothing new has been
applied to disk. With this fix we only commit if committing/committed
will increase. Now the check matches the
Sage Weil [Thu, 7 Oct 2010 23:15:45 +0000 (16:15 -0700)]
osd: loosen caller_ops asserts
The problem is that merge_log adds new items to the log before it unindexes
divergent items, and that behavior is needed by the current implementation
of merge_old_entry(). Since the divergent items may be the same requests
(and frequently are) these asserts needs to be loosened up.
Now, the most recent addition "wins," and we only remove the entry in
unindex() if it points to us.
Sage Weil [Thu, 7 Oct 2010 23:09:25 +0000 (16:09 -0700)]
osd: move to boot state if down OR wrong address in map
Saw an OSD that was up in the map, but the address didn't match. Caused
all kinds of strange behavior. I'm not sure what I had in mind when the
original test only checked for down AND same address before moving to boot
state, since having the wrong address is clearly bad news.