]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
14 years agov0.24.3 v0.24.3
Sage Weil [Thu, 10 Feb 2011 17:14:34 +0000 (09:14 -0800)]
v0.24.3

14 years agomake:add messages/MOSDRepScrub.h to NOINST_HEADERS
Colin Patrick McCabe [Tue, 8 Feb 2011 17:38:50 +0000 (09:38 -0800)]
make:add messages/MOSDRepScrub.h to NOINST_HEADERS

Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
14 years agoMerge remote branch 'origin/rep_scrub_wq' into stable
Sage Weil [Wed, 9 Feb 2011 00:22:01 +0000 (16:22 -0800)]
Merge remote branch 'origin/rep_scrub_wq' into stable

14 years agoosd: discard scrub reply if pg changed
Sage Weil [Tue, 8 Feb 2011 16:41:52 +0000 (08:41 -0800)]
osd: discard scrub reply if pg changed

build_scrub_map will bail out if the pg changed.  Discard the result in
that case since the primary will ignore it anyway.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
14 years agoosd: avoid map_lock for scrub_map reply
Sage Weil [Tue, 8 Feb 2011 16:41:14 +0000 (08:41 -0800)]
osd: avoid map_lock for scrub_map reply

Using osd->osdmap->epoch without map_lock is dangerous.  We can avoid it
entirely by replying on the same connection as the request.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
14 years agoosd: never rewrite log after {advance,activate}_map
Sage Weil [Tue, 8 Feb 2011 16:21:31 +0000 (08:21 -0800)]
osd: never rewrite log after {advance,activate}_map

pg->dirty_log is never true, so this is dead code.  And nothing in either
of those two methods updates the pg log.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
14 years agoosd: always write backlog after creation
Sage Weil [Tue, 8 Feb 2011 16:20:19 +0000 (08:20 -0800)]
osd: always write backlog after creation

dirty_log is never set to true, so we would set the log.backlog flag but
not write it to disk.  If we restarted the OSD, we would think we had the
backlog in the log but in reality we would not.  clean_up_local() could
then erase almost every object in the PG.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
14 years agoosd: fix no missing inferance
Sage Weil [Tue, 8 Feb 2011 15:54:17 +0000 (07:54 -0800)]
osd: fix no missing inferance

Add missing continue in last_update==last_complete (no missing) case.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
14 years agoPG: remove sub_op_scrub
Samuel Just [Sat, 5 Feb 2011 01:23:15 +0000 (17:23 -0800)]
PG: remove sub_op_scrub

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agoPG: switch _request_scrub_map to send MOSDRepScrub
Samuel Just [Fri, 4 Feb 2011 01:02:17 +0000 (17:02 -0800)]
PG: switch _request_scrub_map to send MOSDRepScrub

Also switches sub_op_scrub_reply to sub_op_scrub_map to handle the
OSD_OP_SCRUB_MAP response.

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agoOSD: Adds handler for MOSDRepScrub
Samuel Just [Sat, 5 Feb 2011 01:00:58 +0000 (17:00 -0800)]
OSD: Adds handler for MOSDRepScrub

handle_rep_scrub enqueues the message in rep_scrub_wq.

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agoPG: added replica_scrub
Samuel Just [Sat, 5 Feb 2011 00:59:58 +0000 (16:59 -0800)]
PG: added replica_scrub

Adds handler in PG for MOSDRepScrub messages.  replica_scrub will
replace sub_op_scrub.

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agoOSD: Add rep_scrub_wq
Samuel Just [Sat, 5 Feb 2011 00:58:06 +0000 (16:58 -0800)]
OSD: Add rep_scrub_wq

Previously, replica scrubs would be handled in sub_op_scrub in the op
queue.  Replica scrubs will now be processed by rep_scrub_wq using the
disk tp.

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agorados: Adds CEPH_OSD_OP_SCRUB_MAP sub op
Samuel Just [Fri, 4 Feb 2011 23:46:27 +0000 (15:46 -0800)]
rados: Adds CEPH_OSD_OP_SCRUB_MAP sub op

Previously, maps were requested with a sub_op and sent with a
sub_op_reply.  As maps will now be requested using a different message,
replicas will transmit scrub maps requested via MOSDRepScrub messages by
sending a sub_op of type CEPH_OSD_OP_SCRUB_MAP.

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agoMOSDRepScrub: Adds a message for initiating a replica scrub
Samuel Just [Thu, 3 Feb 2011 22:29:43 +0000 (14:29 -0800)]
MOSDRepScrub: Adds a message for initiating a replica scrub

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agomon: ignore mds boot messages with zeroed port
Sage Weil [Mon, 7 Feb 2011 04:49:59 +0000 (20:49 -0800)]
mon: ignore mds boot messages with zeroed port

On 0.24.2 I saw a zeroed port in the cmds log and in the mdsmap.  Ignore
anything from a cmds with a zeroed port to prevent the insanity from
spreading.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
14 years agoclient: more carefully gaurd local cache truncate
Sage Weil [Sun, 6 Feb 2011 21:55:50 +0000 (13:55 -0800)]
client: more carefully gaurd local cache truncate

This fixes an assert when len=0 in file_to_extents when we get some weird
metadata from the MDS.

Fixes: #778
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
14 years agosignal: fix redefine warnings
Sage Weil [Thu, 3 Feb 2011 18:10:51 +0000 (10:10 -0800)]
signal: fix redefine warnings

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
14 years agoReplicatedPG: snap_trimmer fix leaked lock
Samuel Just [Thu, 3 Feb 2011 19:58:56 +0000 (11:58 -0800)]
ReplicatedPG: snap_trimmer fix leaked lock

Previous patch 7a02070b741d3482ff6b28827c1eb274a2134486 leaks the pg
lock.

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agoReplicatedPG:snap_trimmer should return if !clean or !active or !primary
Samuel Just [Thu, 3 Feb 2011 18:31:47 +0000 (10:31 -0800)]
ReplicatedPG:snap_trimmer should return if !clean or !active or !primary

The PG may become !clean or !active while in the osd snap_trim_wq.

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agomount.ceph: option parsing fix
Samuel Just [Wed, 2 Feb 2011 21:35:21 +0000 (13:35 -0800)]
mount.ceph: option parsing fix

Passing -o secretfile would cause a segfault since searching for = would
result in a null pointer.  New version checks for that case.  Also, *end
cannot be a ,.

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agoFileStore: fix double close
Samuel Just [Tue, 1 Feb 2011 21:48:39 +0000 (13:48 -0800)]
FileStore: fix double close

curr_fd is already closed if cp == cur_seq.  This second close
occasionally ended up closing another thread's fd.  The next open would
tend to grab that fd in op_fd or current_fd which would then get closed
by the other thread leaving op_fd or current_fd pointing to some random
file (or a closed descriptor).

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agoOSD: update_osd_stat take heartbeat_lock
Samuel Just [Fri, 28 Jan 2011 22:07:47 +0000 (14:07 -0800)]
OSD: update_osd_stat take heartbeat_lock

Previously update_osd_stat had a race with code modifying heartbeat_from
causing the iterator increment to occasionally segfault.

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agoLocker: Drop loner correctly!
Greg Farnum [Sat, 29 Jan 2011 00:44:50 +0000 (16:44 -0800)]
Locker: Drop loner correctly!

Our previous check for if we want to drop the loner was incorrect.
Now, it's fixed. Resolves a serious bug with inode write access.

Reported-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
14 years agomds: defer sending resolves until mdsmap.failed.empty()
Sage Weil [Fri, 28 Jan 2011 20:35:38 +0000 (12:35 -0800)]
mds: defer sending resolves until mdsmap.failed.empty()

There is no point sending resolves while there are still failed nodes,
since we can't complete.  We also trigger an assert if we try to send to
a failed node.  Instead just wait until failed.empty() and then start.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
14 years agoosd: fix mutual exclusion for _dispatch
Sage Weil [Fri, 28 Jan 2011 09:24:49 +0000 (01:24 -0800)]
osd: fix mutual exclusion for _dispatch

We want only one thread dispatching messages (either new or requeued), so
that we can preserve ordering.  Previously we weren't doing so for all
callers of do_waiters (tick() and the first in ms_dispatch()).

This fixes osd_sub_op(_reply) ordering problems that trigger the
now-famous repop queue assert.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: preserve ordering when ops are requeued
Sage Weil [Wed, 26 Jan 2011 18:06:49 +0000 (10:06 -0800)]
osd: preserve ordering when ops are requeued

Requeue ops under osd_lock to preserve ordering wrt incoming messages.
Also drain the waiter queue when ms_dispatch takes the lock before calling
_dispatch(m).

Fixes: #743
Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: restart if the osdmap client, heartbeat, OR cluster addrs don't match
Sage Weil [Tue, 25 Jan 2011 23:28:49 +0000 (15:28 -0800)]
osd: restart if the osdmap client, heartbeat, OR cluster addrs don't match

If we somehow get ourselves into a situation where the OSDMap addresses do
not match our actual addresses, restart and try again.  This is still
possible if multiple MOSDBoot messages end up in flight in the monitor,
say due to a monitor disconnect/reconnect, and we race with something that
marks us down in the map.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: avoid extraneous send_boot() calls
Sage Weil [Tue, 25 Jan 2011 23:04:06 +0000 (15:04 -0800)]
osd: avoid extraneous send_boot() calls

Only send_boot() on osdmap update if we are restarting.  Otherwise we can
end up with too many MOSDBoot messages in flight and the monitor may
apply an old one instead of a new one.  For example:

- cosd starts
- send_boot with address set A
- get an osdmap update
- send_boot again with address set A
- get an osdmap update.  now we're up.
- get osdmap update, now we're marked down,
- bind to address set B
- send_boot with address set B

and the monitor may apply the second MOSDBoot (with adddress set A).

This results in an online OSD using a cluster address that differs from
that in the OSDMap.  Which causes problems with peering, among other
things.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoReplicatedPG: _rollback_to fix the just cloned condition
Samuel Just [Tue, 25 Jan 2011 21:58:36 +0000 (13:58 -0800)]
ReplicatedPG: _rollback_to fix the just cloned condition

_rollback_to in the case that head was just cloned and that clone
includes snapid does not need to do anything.  Previously, snapid would
have to match the snap on the clone, but the condition should be that
snapid is contained within the clone's snaps set.

This bug was introduced in e189222f06ee287eeb6fd7f46cff7a6727806dea

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agov0.24.2 v0.24.2
Sage Weil [Mon, 24 Jan 2011 20:53:22 +0000 (12:53 -0800)]
v0.24.2

14 years agomsgr: make connection pipe reset atomic
Sage Weil [Mon, 24 Jan 2011 18:59:21 +0000 (10:59 -0800)]
msgr: make connection pipe reset atomic

Close a small and unlikely race.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomsgr: include con in debug output
Sage Weil [Mon, 24 Jan 2011 18:58:42 +0000 (10:58 -0800)]
msgr: include con in debug output

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agofilestore: don't wait min sync interval on explicit sync()
Sage Weil [Fri, 21 Jan 2011 18:43:53 +0000 (10:43 -0800)]
filestore: don't wait min sync interval on explicit sync()

Also, if we do wait longer, wait on the same cond.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoReplicatedPG: fix snap_trimmer log version bug
Samuel Just [Fri, 21 Jan 2011 21:01:20 +0000 (13:01 -0800)]
ReplicatedPG: fix snap_trimmer log version bug

Previously, ctx->at_version would be the same as ctx->obs->oi.version
leading to the log entry having prior_version == version.
This bug was introduced in d1b85e06fb5ce1cfd5bbc74ba639811b92033909.

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agoFileJournal: don't overflow the journal size.
Greg Farnum [Fri, 21 Jan 2011 22:20:16 +0000 (14:20 -0800)]
FileJournal: don't overflow the journal size.

Previously we were casting it to a uint64_t, but the left shift
occurs before the cast, so we were overflowing in some circumstances.
Split these up to prevent it.

Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
14 years agomds: respawn must unblock signals before exec
Colin Patrick McCabe [Fri, 21 Jan 2011 14:45:40 +0000 (06:45 -0800)]
mds: respawn must unblock signals before exec

Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
14 years agocommon: move signal blocking into signal.cc
Colin Patrick McCabe [Fri, 21 Jan 2011 14:27:55 +0000 (06:27 -0800)]
common: move signal blocking into signal.cc

Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
14 years agocommon: add signal_mask_to_str
Colin Patrick McCabe [Fri, 21 Jan 2011 13:45:01 +0000 (05:45 -0800)]
common: add signal_mask_to_str

Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
14 years agomsgr: always start reaper
Sage Weil [Fri, 21 Jan 2011 18:08:26 +0000 (10:08 -0800)]
msgr: always start reaper

If we didn't explicitly bind (i.e. are a client), then we don't start
the accepter.  That's fine. But the reaper thread start was also
conditional, when it shouldn't be; otherwise the client can't clean up
old Pipes (and their sockets).

Fixes: #732
Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomonclient: fix locking
Sage Weil [Fri, 21 Jan 2011 17:35:31 +0000 (09:35 -0800)]
monclient: fix locking

Hold lock in handle_* methods; assert lock held in all _* methods.

Fixes: #731
Signed-off-by: Sage Weil <sage@newdream.net>
14 years agosignals: signal.cc: trim includes
Colin Patrick McCabe [Thu, 20 Jan 2011 11:34:09 +0000 (03:34 -0800)]
signals: signal.cc: trim includes

Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
14 years agocommon: re-install sighandlers after daemon()
Colin Patrick McCabe [Wed, 19 Jan 2011 17:24:53 +0000 (09:24 -0800)]
common: re-install sighandlers after daemon()

Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
14 years agocommon: move signal handler stuff into signal.cc
Colin Patrick McCabe [Wed, 19 Jan 2011 17:15:02 +0000 (09:15 -0800)]
common: move signal handler stuff into signal.cc

Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
14 years agoReplicatedPG.cc: fix snap_trimmer object context error
Samuel Just [Thu, 20 Jan 2011 00:47:57 +0000 (16:47 -0800)]
ReplicatedPG.cc: fix snap_trimmer object context error

Previously, snap_trimmer would get the clone object information from the
object store rather than using find_object_context.  This would cause
the cached version to not be updated with the new version in the case
that the object information got updated.  As a result, the need field of
the missing object could get a stale version inconsistent with the most
recent logged version.

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agoReplicatedPG.cc: update coi version and prior_version to match log
Samuel Just [Wed, 19 Jan 2011 22:02:27 +0000 (14:02 -0800)]
ReplicatedPG.cc: update coi version and prior_version to match log

Caused error where oi on clone would not get updated version when snaps
was updated.  oi.version would lag behind the missing item's need field
during recovery.

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agoReplicatedPG.cc: fix use of potentially invalid pointer
Samuel Just [Wed, 19 Jan 2011 20:06:17 +0000 (12:06 -0800)]
ReplicatedPG.cc: fix use of potentially invalid pointer

rollback_to may not be initialized if ret != 0.

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agoReplicatedPG,PG,OSD: snap_trimmer should run only when the PG is clean
Samuel Just [Thu, 20 Jan 2011 00:47:32 +0000 (16:47 -0800)]
ReplicatedPG,PG,OSD: snap_trimmer should run only when the PG is clean

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agosignals: handle_fatal_signal: use SA_NODEFER
Colin Patrick McCabe [Tue, 28 Dec 2010 02:04:17 +0000 (18:04 -0800)]
signals: handle_fatal_signal: use SA_NODEFER

SA_RESETHAND | SA_NODEFER allows the "re-trigger default signal handler"
trick to work for signals other than SIGSEGV.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agosignals: backtrace some more exotic fatal signals
Colin Patrick McCabe [Tue, 28 Dec 2010 01:49:57 +0000 (17:49 -0800)]
signals: backtrace some more exotic fatal signals

We're not likely to see these, but if we do, we want it in the logs!

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agosignals: Handle SIGILL, SIGBUS, SIGFPE.
Colin Patrick McCabe [Tue, 28 Dec 2010 01:29:57 +0000 (17:29 -0800)]
signals: Handle SIGILL, SIGBUS, SIGFPE.

Print out a backtrace when we get SIGILL, SIGBUS, or SIGFPE. Fix a bug
where we failed to install a SIGABRT handler.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agomds: fix journaling of root default_file_layout
Sage Weil [Wed, 19 Jan 2011 17:50:41 +0000 (09:50 -0800)]
mds: fix journaling of root default_file_layout

We need to include the default_file_layout (if any) on root inodes, too.

Fixes: #725
Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomon: remove rank from failed when taking over for failed node
Sage Weil [Tue, 18 Jan 2011 23:13:07 +0000 (15:13 -0800)]
mon: remove rank from failed when taking over for failed node

Leaving it there leaves a broken MDSMap, and prevents rejoin because
MDSMap::is_rejoining() is always false.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: kick discovers when peers enter active|clientreplay|rejoin
Sage Weil [Tue, 18 Jan 2011 23:09:51 +0000 (15:09 -0800)]
mds: kick discovers when peers enter active|clientreplay|rejoin

We process discovers when active, clientreplay, or later stages of rejoin.
Wait until then to resend pending discovers.  In particular, do NOT send
them when the node has just failed (from handle_mds_failure), or we will
crash.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomon: fix 'ceph mds fail <N>' command
Sage Weil [Tue, 18 Jan 2011 21:27:10 +0000 (13:27 -0800)]
mon: fix 'ceph mds fail <N>' command

We need to remove the mds_info from the map for cmds to take notice.

Fixes: #720
Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoPG: fix adjust_local_snaps bug
Samuel Just [Tue, 18 Jan 2011 21:16:34 +0000 (13:16 -0800)]
PG: fix adjust_local_snaps bug

current must be removed from to_remove in the loop for the loop to
terminate (and not cause a double erasure from snap_collections)!

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agoMerge branch 'purged_snaps' into testing
Sage Weil [Tue, 18 Jan 2011 15:37:58 +0000 (07:37 -0800)]
Merge branch 'purged_snaps' into testing

14 years agoosd: rebind heartbeat_messenger (with cluster one) when wrongly marked down
Sage Weil [Sat, 15 Jan 2011 00:58:47 +0000 (16:58 -0800)]
osd: rebind heartbeat_messenger (with cluster one) when wrongly marked down

This keeps things clean.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomessenger: let rebind() avoid multiple ports
Sage Weil [Sat, 15 Jan 2011 00:58:19 +0000 (16:58 -0800)]
messenger: let rebind() avoid multiple ports

We need to rebind two messengers, which means avoiding both old ports.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: drop messages from before we moved back to boot state
Sage Weil [Sat, 15 Jan 2011 00:57:38 +0000 (16:57 -0800)]
osd: drop messages from before we moved back to boot state

We want to make sure we ignore any messages sent to us before we moved
back to the boot state (after being wrongly marked down).  This is only
a problem currently while we are in the BOOT state and waiting to be
re-added to the map, because we may then call _share_map_incoming and
send something on the new rebound messenger to an old peer.  Also assert
that we are !booting there to be sure.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoMDS: Use new C_Gather::get_num_remaining() in MDCache.
Greg Farnum [Sat, 15 Jan 2011 00:22:11 +0000 (16:22 -0800)]
MDS: Use new C_Gather::get_num_remaining() in MDCache.

It was using get_num(), which now reports the number created.
This probably wouldn't have worked previously except that
~C_Gather::C_GatherSub was inappropriately calling rm_sub().

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoC_Gather: Set debug #ifdefs to remove set.
Greg Farnum [Sat, 15 Jan 2011 00:12:32 +0000 (16:12 -0800)]
C_Gather: Set debug #ifdefs to remove set.

This way when we're confident it works right, we can
remove the set<Context*> and just rely on ref counting.

Further optimizations would include using a spinlock
rather than a mutex, or possibly even just switching
sub_[created|existing]_count to be atomics.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoC_Gather: Rewrite for thread safety.
Greg Farnum [Sat, 15 Jan 2011 00:11:01 +0000 (16:11 -0800)]
C_Gather: Rewrite for thread safety.

Previously, C_Gather wasn't thread safe at all,
and there was an issue with creating subs while some
subs were being finished.
These issues are now fixed.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agomds: call MonClient::shutdown when doing a journal dump.
Greg Farnum [Wed, 12 Jan 2011 22:46:30 +0000 (14:46 -0800)]
mds: call MonClient::shutdown when doing a journal dump.

Previously we got a failed assert since nothing was calling this.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoos: don't crash on no-journal case
Colin Patrick McCabe [Sun, 9 Jan 2011 21:34:40 +0000 (13:34 -0800)]
os: don't crash on no-journal case

JournalingObjectStore::commit_start should handle the case where journal is
null. This will occur if the user doesn't configure a journal.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agomds: tolerate (with warning) replayed op with bad prealloc_inos
Sage Weil [Fri, 14 Jan 2011 06:08:40 +0000 (22:08 -0800)]
mds: tolerate (with warning) replayed op with bad prealloc_inos

This comes up when an ESesssion close is followed by an EMetaBlob that
uses a prealloc_ino.  That isn't supposed to happen (it's probably a corner
case with session timeout vs a request waiting on locks that didn't
get killed/canceled?).  But tolerate it during replay just the same.

Works around #708.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: improve debug output on ESession journal replay
Sage Weil [Fri, 14 Jan 2011 05:51:05 +0000 (21:51 -0800)]
mds: improve debug output on ESession journal replay

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoOSD,ReplicatedPG: Do not run snap_trimmer while the pg is degraded
Samuel Just [Fri, 14 Jan 2011 00:18:40 +0000 (16:18 -0800)]
OSD,ReplicatedPG: Do not run snap_trimmer while the pg is degraded

snap_trimmer causes replica crashes if the replica is missing
objects.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoReplicatedPG: snap_trimmer skip removed snaps without collections
Samuel Just [Thu, 13 Jan 2011 19:15:15 +0000 (11:15 -0800)]
ReplicatedPG: snap_trimmer skip removed snaps without collections

If no writes are made between two snapshots, the first won't get a snap
collection.  Subsequently removing that snap causes a crash in
snap_trimmer since the collection does not exist.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoOSD: _pg_process_info refactor to use adjust_local_snaps
Samuel Just [Thu, 13 Jan 2011 19:10:31 +0000 (11:10 -0800)]
OSD: _pg_process_info refactor to use adjust_local_snaps

Changes _pg_process_info to use adjust_local_snaps.  Also accounts for
the incoming info not being a superset of the existing info.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoPG: added adjust_local_snaps, activate now checks local collections
Samuel Just [Thu, 13 Jan 2011 19:02:58 +0000 (11:02 -0800)]
PG: added adjust_local_snaps, activate now checks local collections

adjust_local_snaps handles removing local collections contained in
to_check.  On activate, pg will now remove local collections contained
in purged_snaps.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoPG: change snap_collections to an interval_set
Samuel Just [Tue, 21 Dec 2010 20:01:41 +0000 (12:01 -0800)]
PG: change snap_collections to an interval_set

Previously, the set of local snap collections was represented using a
set, which complicates set operations with interval_sets.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoPG: activate should not enqueue snap_trimmer on a replica
Samuel Just [Thu, 13 Jan 2011 20:18:17 +0000 (12:18 -0800)]
PG: activate should not enqueue snap_trimmer on a replica

Previously, activate would queue_snap_trim() for replicas if snap_trimq
ended up non-empty, guaranteeing a crash for any replica starting up
while purged_snaps lagged behind pool->cached_removed_snaps.

This should fix #702.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agofilejournal: rewrite completion handling, fix ordering on full->notfull
Sage Weil [Thu, 13 Jan 2011 21:14:24 +0000 (13:14 -0800)]
filejournal: rewrite completion handling, fix ordering on full->notfull

Rewriting the completion handling to be simpler, clearer, so that it is
easier to maintain a strict completion ordering invariant.

This also fixes an ordering bug: When restarting journal, we defer
initially until we get a committed_thru from the previous commit and then
do all those completions.  That same logic needs to also apply to new items
submitted during that commit interval.  This was broken before, but the
simpler structure fixes it.  Fixes #666.

Tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoReplicatedPG: Fix oi.size bug in _rollback_to
Samuel Just [Wed, 12 Jan 2011 23:09:51 +0000 (15:09 -0800)]
ReplicatedPG: Fix oi.size bug in _rollback_to

_rollback_to calls _delete_head before cloning the clone into place.
_delete_head sets the object info size to 0.  _rollback_to now resets
the size to match the rolled back object.  Previously, this bug
manifested as a failed assert in scrub when checking the object sizes.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoReplicatedPG: register_object_context and register_snapset_context cleanup
Samuel Just [Wed, 12 Jan 2011 21:51:55 +0000 (13:51 -0800)]
ReplicatedPG: register_object_context and register_snapset_context cleanup

Previously, get_object_context and get_snapset_context did not register
the resulting objects.  In some cases, these objects would not get
registered and multiple copies would end up created.  This caused a bug
in find_object_context where get_snapset_context could return an object
distinct from the one referenced by the object returned from
get_object_context.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoReplicatedPG: snap_trimmer work around
Samuel Just [Wed, 12 Jan 2011 20:07:44 +0000 (12:07 -0800)]
ReplicatedPG: snap_trimmer work around

Currently, an OSD bug is causing snap_trimq to contain some snaps
already in purged_snaps.  This work around should let kvmtest
come back up.  A real fix is still needed.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoosd: OSD::queue_pg_for_deletion: avoid double del
Colin Patrick McCabe [Tue, 11 Jan 2011 18:15:02 +0000 (10:15 -0800)]
osd: OSD::queue_pg_for_deletion: avoid double del

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agomds: avoid double-pinning stray inodes
Sage Weil [Tue, 11 Jan 2011 17:50:20 +0000 (09:50 -0800)]
mds: avoid double-pinning stray inodes

We make multiple iterations through populate_mydir().  Only pin each stray
once.  Fixes #689 and crashes like

mds/CInode.h: In function 'virtual void CInode::bad_get(int)':
mds/CInode.h:1088: FAILED assert(ref_set.count(by) == 0)
ceph version 0.24 (180a4176035521940390f4ce24ee3eb7aa290632)
1: (CInode::bad_put(int)+0) [0x827b090]
2: (MDSCacheObject::get(int)+0x153) [0x813e463]
3: (MDCache::populate_mydir()+0x8a) [0x81a7e5a]
4: (MDCache::_create_system_file_finish(Mutation*, CDentry*,
Context*)+0x181) [0x819f501]
5: (C_MDC_CreateSystemFile::finish(int)+0x29) [0x81d6c29]
6: (finish_contexts(std::list<Context*, std::allocator<Context*> >&,
int)+0x6b) [0x81d663b]
7: (Journaler::_finish_flush(int, long long, utime_t, bool)+0x983) [0x82f2f53]
8: (Journaler::C_Flush::finish(int)+0x3f) [0x82fb24f]
9: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x801) [0x82d8e31]
10: (MDS::_dispatch(Message*)+0x2ae5) [0x80eaa15]
11: (MDS::ms_dispatch(Message*)+0x62) [0x80eb142]
12: (SimpleMessenger::dispatch_entry()+0x899) [0x80b8649]
13: (SimpleMessenger::DispatchThread::entry()+0x22) [0x80b30f2]

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoReplicatedPG: Fix bug in rollback
Samuel Just [Mon, 10 Jan 2011 22:45:06 +0000 (14:45 -0800)]
ReplicatedPG: Fix bug in rollback

Previously, _rollback_to assumed that the rollback was a noop if
ctx->clone_obc was set and it's prior version matches head's version.
However, this broke in sequences like:

Write "snap1 contents" to oid "blah"
create snapshot "snap1"
Write "snap2 contents" to oid "blah"
create snapshot "snap2"
rollback oid "blah" to snapshot "snap1"

In this case, make_writeable would have just cloned head to the snap2
clone, but the relevant clone is actually "snap1".  _rollback_to now
verifies that the most recent clone is the correct one before assuming
that head is already correct.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agov0.24.1 v0.24.1
Sage Weil [Sat, 8 Jan 2011 00:50:15 +0000 (16:50 -0800)]
v0.24.1

14 years agoReplicatedPG: get_object_context ssc refcount leak
Samuel Just [Fri, 7 Jan 2011 22:23:04 +0000 (14:23 -0800)]
ReplicatedPG: get_object_context ssc refcount leak

If obc->obs.ssc is non-null, the second get_snapset_context ends up
leaking a snapset reference.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoReplicatedPG: clone_overlap should contain one entry per clone
Samuel Just [Thu, 6 Jan 2011 23:48:13 +0000 (15:48 -0800)]
ReplicatedPG: clone_overlap should contain one entry per clone

Previously, writefull and _delete_head would remove the last
entry from snapset.clone_overlap.  Now, the last entry becomes
an empty interval_set.  clone_overlap should contain one entry
per clone.

The missing entries previously caused a bug in _rollback_to where
iter would be clone_overlap.end().

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoPG: Fixes bug in _scrub with checking clones
Samuel Just [Tue, 4 Jan 2011 22:30:15 +0000 (14:30 -0800)]
PG: Fixes bug in _scrub with checking clones

I introduced this bug in
4a4a1e53c7d380cd0b582c1d0685fd0ef4ef1711.
curclone++ not curclone--.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoPG: Fix bug in scrub when checking clone sizes
Samuel Just [Tue, 4 Jan 2011 00:48:39 +0000 (16:48 -0800)]
PG: Fix bug in scrub when checking clone sizes

Previosly, _scrub checked:
assert(p->second.size == snapset.clone_size[curclone])

curclone was, however, an index into snapset.clones rather than a
snapid_t.  For clarity, curclone is now an iterator.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agomds: assert no submit_entry during replay state
Sage Weil [Fri, 24 Dec 2010 17:00:28 +0000 (09:00 -0800)]
mds: assert no submit_entry during replay state

We should never submit items to the journal during replay.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: start new log segment resolve start, not replay finish
Sage Weil [Fri, 24 Dec 2010 17:00:02 +0000 (09:00 -0800)]
mds: start new log segment resolve start, not replay finish

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: clean up backlog generation checks a bit
Sage Weil [Fri, 24 Dec 2010 16:36:28 +0000 (08:36 -0800)]
osd: clean up backlog generation checks a bit

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: generate backlog if needed to get last_complete >= log.tail || backlog
Sage Weil [Fri, 24 Dec 2010 16:36:05 +0000 (08:36 -0800)]
osd: generate backlog if needed to get last_complete >= log.tail || backlog

If primary or a replica has a mistrimmed pg log, we need to generate the
backlog during peering.  This sucks, because the PG won't go active for
a long time, but it's what happens when there's a bug in the code that
mis-trims the PG log!

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: send sufficient log to compensate for replicas with last_complate < log.tail
Sage Weil [Fri, 24 Dec 2010 16:27:38 +0000 (08:27 -0800)]
osd: send sufficient log to compensate for replicas with last_complate < log.tail

If a replica has last_complete < log.tail and no backlog, send enough log
for them to get back into a consistent state.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: load root inode on replay if auth
Sage Weil [Mon, 3 Jan 2011 22:32:48 +0000 (14:32 -0800)]
mds: load root inode on replay if auth

If we are auth for the root inode, load it's initial value off of disk. We
may not see it in the log if it has not been modified.  If it has, this
is useless but fast/harmless.  This only occurs for brand-new filesystems
where the mds is immediately restarted.

Fixes #671.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomsgr: Unlock dispatch_queue.lock when short-circuiting queue_received.
Greg Farnum [Mon, 3 Jan 2011 22:14:00 +0000 (14:14 -0800)]
msgr: Unlock dispatch_queue.lock when short-circuiting queue_received.

Previously we left the mutex locked, which is obviously bad bad bad!
I believe this was the cause of #673.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agofilestore: assert on out of order journal pipeline submissions
Sage Weil [Mon, 3 Jan 2011 21:14:49 +0000 (13:14 -0800)]
filestore: assert on out of order journal pipeline submissions

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agofilestore: fix wake condition when journal submission blocks
Sage Weil [Mon, 3 Jan 2011 21:14:13 +0000 (13:14 -0800)]
filestore: fix wake condition when journal submission blocks

We only want to wake up if we are at the front of the line, in order to
preserve journal submission pipeline ordering.

This fixes, among other things, messages in the log like

2010-12-21 10:38:42.515974 7f0861486700 journal op_submit_finish 5364 expected 5370, OUT OF ORDER

and bug #666.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: fix purge_stray for directories, zeroed layouts
Sage Weil [Mon, 3 Jan 2011 19:50:53 +0000 (11:50 -0800)]
mds: fix purge_stray for directories, zeroed layouts

- We don't want to purge file content on directories
- Don't fall over if a file has a zero period

Reported-by: Paul Komkoff <i@stingr.net>
Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: PG::Info::History: init last_epoch_clean
Colin Patrick McCabe [Wed, 29 Dec 2010 01:03:12 +0000 (17:03 -0800)]
osd: PG::Info::History: init last_epoch_clean

It seems that we have not been zeroing
PG::Info::History:last_epoch_clean when the History structure is
created. This led to some very interesting log output (and bugs!)

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoSimpleMessenger.cc: Fixes a dispatch_throttler leak in queue_received
Samuel Just [Wed, 1 Dec 2010 00:52:40 +0000 (16:52 -0800)]
SimpleMessenger.cc: Fixes a dispatch_throttler leak in queue_received
when the pipe has been halted.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agov0.24 v0.24
Sage Weil [Wed, 15 Dec 2010 21:01:57 +0000 (13:01 -0800)]
v0.24

14 years agoosd: compensate for replicas with tail > last_complete
Sage Weil [Mon, 20 Dec 2010 21:22:49 +0000 (13:22 -0800)]
osd: compensate for replicas with tail > last_complete

Normally we shouldn't ever have a last_complete < log.tail (&& !backlog).
But maybe we do (old bugs, whatever; see #590).  In that case, the primary
can compensate by sending more log info to the replica.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: make nested scatterlock state change check more robust
Sage Weil [Sat, 18 Dec 2010 05:02:58 +0000 (21:02 -0800)]
mds: make nested scatterlock state change check more robust

The predirty_journal_parents() calls wrlock_start() with nowait=true
because it has a journal entry open and we don't want to trigger a nested
scatterlock change that needs to journal something again (either
via scatter_writebehind or scatter_start).  (MDLog can only handle a single
log entry open at once because building multiple at once would require very
very very careful ordering of predirty() calls and versions.)

We were already check for the simple_lock() case (which may call
writebehind); fix up the check to also cover the scatter_mix() (which may
call scatter_start) case.

Fixes this crash:

mds/MDLog.h: In function 'void MDLog::start_entry(LogEvent*)':
mds/MDLog.h:191: FAILED assert(cur_event == __null)
 ceph version 0.24~rc (commit:fe10300317383ec29948d7dbe3cb31b3aa277e3c)
 1: (CInode::finish_scatter_update(ScatterLock*, CDir*, unsigned long, unsigned long)+0x804) [0x606e14]
 2: (CInode::start_scatter(ScatterLock*)+0xaa) [0x60dc1a]
 3: (Locker::scatter_mix(ScatterLock*, bool*)+0x1ca) [0x589a9a]
 4: (Locker::wrlock_start(SimpleLock*, MDRequest*, bool)+0x165) [0x597d65]
 5: (MDCache::predirty_journal_parents(Mutation*, EMetaBlob*, CInode*, CDir*, int, int, snapid_t)+0x153e) [0x55a70e]
 6: (Locker::scatter_writebehind(ScatterLock*)+0x42d) [0x58553d]
 7: (Locker::simple_lock(SimpleLock*, bool*)+0x7ab) [0x58beeb]
 8: (Locker::scatter_nudge(ScatterLock*, Context*, bool)+0x3ad) [0x58c49d]
 9: (Locker::scatter_tick()+0x28a) [0x58c98a]
 10: (MDS::tick()+0x4e4) [0x4b26a4]
 11: (SafeTimer::timer_thread()+0x22c) [0x6d164c]
 12: (SafeTimerThread::entry()+0xd) [0x6d34bd]
 13: (Thread::_entry_func(void*)+0xa) [0x4943da]
 14: /lib/libpthread.so.0 [0x7fc87810b73a]
 15: (clone()+0x6d) [0x7fc876dad69d]

Signed-off-by: Sage Weil <sage@newdream.net>