Sage Weil [Tue, 8 Feb 2011 16:20:19 +0000 (08:20 -0800)]
osd: always write backlog after creation
dirty_log is never set to true, so we would set the log.backlog flag but
not write it to disk. If we restarted the OSD, we would think we had the
backlog in the log but in reality we would not. clean_up_local() could
then erase almost every object in the PG.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Samuel Just [Fri, 4 Feb 2011 23:46:27 +0000 (15:46 -0800)]
rados: Adds CEPH_OSD_OP_SCRUB_MAP sub op
Previously, maps were requested with a sub_op and sent with a
sub_op_reply. As maps will now be requested using a different message,
replicas will transmit scrub maps requested via MOSDRepScrub messages by
sending a sub_op of type CEPH_OSD_OP_SCRUB_MAP.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Mon, 7 Feb 2011 04:49:59 +0000 (20:49 -0800)]
mon: ignore mds boot messages with zeroed port
On 0.24.2 I saw a zeroed port in the cmds log and in the mdsmap. Ignore
anything from a cmds with a zeroed port to prevent the insanity from
spreading.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Samuel Just [Wed, 2 Feb 2011 21:35:21 +0000 (13:35 -0800)]
mount.ceph: option parsing fix
Passing -o secretfile would cause a segfault since searching for = would
result in a null pointer. New version checks for that case. Also, *end
cannot be a ,.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Samuel Just [Tue, 1 Feb 2011 21:48:39 +0000 (13:48 -0800)]
FileStore: fix double close
curr_fd is already closed if cp == cur_seq. This second close
occasionally ended up closing another thread's fd. The next open would
tend to grab that fd in op_fd or current_fd which would then get closed
by the other thread leaving op_fd or current_fd pointing to some random
file (or a closed descriptor).
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Fri, 28 Jan 2011 20:35:38 +0000 (12:35 -0800)]
mds: defer sending resolves until mdsmap.failed.empty()
There is no point sending resolves while there are still failed nodes,
since we can't complete. We also trigger an assert if we try to send to
a failed node. Instead just wait until failed.empty() and then start.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 28 Jan 2011 09:24:49 +0000 (01:24 -0800)]
osd: fix mutual exclusion for _dispatch
We want only one thread dispatching messages (either new or requeued), so
that we can preserve ordering. Previously we weren't doing so for all
callers of do_waiters (tick() and the first in ms_dispatch()).
This fixes osd_sub_op(_reply) ordering problems that trigger the
now-famous repop queue assert.
Sage Weil [Wed, 26 Jan 2011 18:06:49 +0000 (10:06 -0800)]
osd: preserve ordering when ops are requeued
Requeue ops under osd_lock to preserve ordering wrt incoming messages.
Also drain the waiter queue when ms_dispatch takes the lock before calling
_dispatch(m).
Fixes: #743 Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Tue, 25 Jan 2011 23:28:49 +0000 (15:28 -0800)]
osd: restart if the osdmap client, heartbeat, OR cluster addrs don't match
If we somehow get ourselves into a situation where the OSDMap addresses do
not match our actual addresses, restart and try again. This is still
possible if multiple MOSDBoot messages end up in flight in the monitor,
say due to a monitor disconnect/reconnect, and we race with something that
marks us down in the map.
Sage Weil [Tue, 25 Jan 2011 23:04:06 +0000 (15:04 -0800)]
osd: avoid extraneous send_boot() calls
Only send_boot() on osdmap update if we are restarting. Otherwise we can
end up with too many MOSDBoot messages in flight and the monitor may
apply an old one instead of a new one. For example:
- cosd starts
- send_boot with address set A
- get an osdmap update
- send_boot again with address set A
- get an osdmap update. now we're up.
- get osdmap update, now we're marked down,
- bind to address set B
- send_boot with address set B
and the monitor may apply the second MOSDBoot (with adddress set A).
This results in an online OSD using a cluster address that differs from
that in the OSDMap. Which causes problems with peering, among other
things.
Samuel Just [Tue, 25 Jan 2011 21:58:36 +0000 (13:58 -0800)]
ReplicatedPG: _rollback_to fix the just cloned condition
_rollback_to in the case that head was just cloned and that clone
includes snapid does not need to do anything. Previously, snapid would
have to match the snap on the clone, but the condition should be that
snapid is contained within the clone's snaps set.
Samuel Just [Fri, 21 Jan 2011 21:01:20 +0000 (13:01 -0800)]
ReplicatedPG: fix snap_trimmer log version bug
Previously, ctx->at_version would be the same as ctx->obs->oi.version
leading to the log entry having prior_version == version.
This bug was introduced in d1b85e06fb5ce1cfd5bbc74ba639811b92033909.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Greg Farnum [Fri, 21 Jan 2011 22:20:16 +0000 (14:20 -0800)]
FileJournal: don't overflow the journal size.
Previously we were casting it to a uint64_t, but the left shift
occurs before the cast, so we were overflowing in some circumstances.
Split these up to prevent it.
Sage Weil [Fri, 21 Jan 2011 18:08:26 +0000 (10:08 -0800)]
msgr: always start reaper
If we didn't explicitly bind (i.e. are a client), then we don't start
the accepter. That's fine. But the reaper thread start was also
conditional, when it shouldn't be; otherwise the client can't clean up
old Pipes (and their sockets).
Fixes: #732 Signed-off-by: Sage Weil <sage@newdream.net>
Previously, snap_trimmer would get the clone object information from the
object store rather than using find_object_context. This would cause
the cached version to not be updated with the new version in the case
that the object information got updated. As a result, the need field of
the missing object could get a stale version inconsistent with the most
recent logged version.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Samuel Just [Wed, 19 Jan 2011 22:02:27 +0000 (14:02 -0800)]
ReplicatedPG.cc: update coi version and prior_version to match log
Caused error where oi on clone would not get updated version when snaps
was updated. oi.version would lag behind the missing item's need field
during recovery.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Tue, 18 Jan 2011 23:09:51 +0000 (15:09 -0800)]
mds: kick discovers when peers enter active|clientreplay|rejoin
We process discovers when active, clientreplay, or later stages of rejoin.
Wait until then to resend pending discovers. In particular, do NOT send
them when the node has just failed (from handle_mds_failure), or we will
crash.
Sage Weil [Sat, 15 Jan 2011 00:57:38 +0000 (16:57 -0800)]
osd: drop messages from before we moved back to boot state
We want to make sure we ignore any messages sent to us before we moved
back to the boot state (after being wrongly marked down). This is only
a problem currently while we are in the BOOT state and waiting to be
re-added to the map, because we may then call _share_map_incoming and
send something on the new rebound messenger to an old peer. Also assert
that we are !booting there to be sure.
Greg Farnum [Sat, 15 Jan 2011 00:22:11 +0000 (16:22 -0800)]
MDS: Use new C_Gather::get_num_remaining() in MDCache.
It was using get_num(), which now reports the number created.
This probably wouldn't have worked previously except that
~C_Gather::C_GatherSub was inappropriately calling rm_sub().
Greg Farnum [Sat, 15 Jan 2011 00:11:01 +0000 (16:11 -0800)]
C_Gather: Rewrite for thread safety.
Previously, C_Gather wasn't thread safe at all,
and there was an issue with creating subs while some
subs were being finished.
These issues are now fixed.
Sage Weil [Fri, 14 Jan 2011 06:08:40 +0000 (22:08 -0800)]
mds: tolerate (with warning) replayed op with bad prealloc_inos
This comes up when an ESesssion close is followed by an EMetaBlob that
uses a prealloc_ino. That isn't supposed to happen (it's probably a corner
case with session timeout vs a request waiting on locks that didn't
get killed/canceled?). But tolerate it during replay just the same.
Samuel Just [Thu, 13 Jan 2011 19:15:15 +0000 (11:15 -0800)]
ReplicatedPG: snap_trimmer skip removed snaps without collections
If no writes are made between two snapshots, the first won't get a snap
collection. Subsequently removing that snap causes a crash in
snap_trimmer since the collection does not exist.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Samuel Just [Thu, 13 Jan 2011 19:02:58 +0000 (11:02 -0800)]
PG: added adjust_local_snaps, activate now checks local collections
adjust_local_snaps handles removing local collections contained in
to_check. On activate, pg will now remove local collections contained
in purged_snaps.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Samuel Just [Thu, 13 Jan 2011 20:18:17 +0000 (12:18 -0800)]
PG: activate should not enqueue snap_trimmer on a replica
Previously, activate would queue_snap_trim() for replicas if snap_trimq
ended up non-empty, guaranteeing a crash for any replica starting up
while purged_snaps lagged behind pool->cached_removed_snaps.
This should fix #702.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Sage Weil [Thu, 13 Jan 2011 21:14:24 +0000 (13:14 -0800)]
filejournal: rewrite completion handling, fix ordering on full->notfull
Rewriting the completion handling to be simpler, clearer, so that it is
easier to maintain a strict completion ordering invariant.
This also fixes an ordering bug: When restarting journal, we defer
initially until we get a committed_thru from the previous commit and then
do all those completions. That same logic needs to also apply to new items
submitted during that commit interval. This was broken before, but the
simpler structure fixes it. Fixes #666.
Tested-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Sage Weil <sage@newdream.net>
Samuel Just [Wed, 12 Jan 2011 23:09:51 +0000 (15:09 -0800)]
ReplicatedPG: Fix oi.size bug in _rollback_to
_rollback_to calls _delete_head before cloning the clone into place.
_delete_head sets the object info size to 0. _rollback_to now resets
the size to match the rolled back object. Previously, this bug
manifested as a failed assert in scrub when checking the object sizes.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Samuel Just [Wed, 12 Jan 2011 21:51:55 +0000 (13:51 -0800)]
ReplicatedPG: register_object_context and register_snapset_context cleanup
Previously, get_object_context and get_snapset_context did not register
the resulting objects. In some cases, these objects would not get
registered and multiple copies would end up created. This caused a bug
in find_object_context where get_snapset_context could return an object
distinct from the one referenced by the object returned from
get_object_context.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Samuel Just [Wed, 12 Jan 2011 20:07:44 +0000 (12:07 -0800)]
ReplicatedPG: snap_trimmer work around
Currently, an OSD bug is causing snap_trimq to contain some snaps
already in purged_snaps. This work around should let kvmtest
come back up. A real fix is still needed.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Samuel Just [Mon, 10 Jan 2011 22:45:06 +0000 (14:45 -0800)]
ReplicatedPG: Fix bug in rollback
Previously, _rollback_to assumed that the rollback was a noop if
ctx->clone_obc was set and it's prior version matches head's version.
However, this broke in sequences like:
Write "snap1 contents" to oid "blah"
create snapshot "snap1"
Write "snap2 contents" to oid "blah"
create snapshot "snap2"
rollback oid "blah" to snapshot "snap1"
In this case, make_writeable would have just cloned head to the snap2
clone, but the relevant clone is actually "snap1". _rollback_to now
verifies that the most recent clone is the correct one before assuming
that head is already correct.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Samuel Just [Thu, 6 Jan 2011 23:48:13 +0000 (15:48 -0800)]
ReplicatedPG: clone_overlap should contain one entry per clone
Previously, writefull and _delete_head would remove the last
entry from snapset.clone_overlap. Now, the last entry becomes
an empty interval_set. clone_overlap should contain one entry
per clone.
The missing entries previously caused a bug in _rollback_to where
iter would be clone_overlap.end().
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Sage Weil [Fri, 24 Dec 2010 16:36:05 +0000 (08:36 -0800)]
osd: generate backlog if needed to get last_complete >= log.tail || backlog
If primary or a replica has a mistrimmed pg log, we need to generate the
backlog during peering. This sucks, because the PG won't go active for
a long time, but it's what happens when there's a bug in the code that
mis-trims the PG log!
Sage Weil [Mon, 3 Jan 2011 22:32:48 +0000 (14:32 -0800)]
mds: load root inode on replay if auth
If we are auth for the root inode, load it's initial value off of disk. We
may not see it in the log if it has not been modified. If it has, this
is useless but fast/harmless. This only occurs for brand-new filesystems
where the mds is immediately restarted.
It seems that we have not been zeroing
PG::Info::History:last_epoch_clean when the History structure is
created. This led to some very interesting log output (and bugs!)
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Mon, 20 Dec 2010 21:22:49 +0000 (13:22 -0800)]
osd: compensate for replicas with tail > last_complete
Normally we shouldn't ever have a last_complete < log.tail (&& !backlog).
But maybe we do (old bugs, whatever; see #590). In that case, the primary
can compensate by sending more log info to the replica.
Sage Weil [Sat, 18 Dec 2010 05:02:58 +0000 (21:02 -0800)]
mds: make nested scatterlock state change check more robust
The predirty_journal_parents() calls wrlock_start() with nowait=true
because it has a journal entry open and we don't want to trigger a nested
scatterlock change that needs to journal something again (either
via scatter_writebehind or scatter_start). (MDLog can only handle a single
log entry open at once because building multiple at once would require very
very very careful ordering of predirty() calls and versions.)
We were already check for the simple_lock() case (which may call
writebehind); fix up the check to also cover the scatter_mix() (which may
call scatter_start) case.