]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
14 years agomds: respawn must unblock signals before exec
Colin Patrick McCabe [Fri, 21 Jan 2011 14:45:40 +0000 (06:45 -0800)]
mds: respawn must unblock signals before exec

Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
14 years agocommon: move signal blocking into signal.cc
Colin Patrick McCabe [Fri, 21 Jan 2011 14:27:55 +0000 (06:27 -0800)]
common: move signal blocking into signal.cc

Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
14 years agocommon: add signal_mask_to_str
Colin Patrick McCabe [Fri, 21 Jan 2011 13:45:01 +0000 (05:45 -0800)]
common: add signal_mask_to_str

Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
14 years agomsgr: always start reaper
Sage Weil [Fri, 21 Jan 2011 18:08:26 +0000 (10:08 -0800)]
msgr: always start reaper

If we didn't explicitly bind (i.e. are a client), then we don't start
the accepter.  That's fine. But the reaper thread start was also
conditional, when it shouldn't be; otherwise the client can't clean up
old Pipes (and their sockets).

Fixes: #732
Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomonclient: fix locking
Sage Weil [Fri, 21 Jan 2011 17:35:31 +0000 (09:35 -0800)]
monclient: fix locking

Hold lock in handle_* methods; assert lock held in all _* methods.

Fixes: #731
Signed-off-by: Sage Weil <sage@newdream.net>
14 years agosignals: signal.cc: trim includes
Colin Patrick McCabe [Thu, 20 Jan 2011 11:34:09 +0000 (03:34 -0800)]
signals: signal.cc: trim includes

Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
14 years agocommon: re-install sighandlers after daemon()
Colin Patrick McCabe [Wed, 19 Jan 2011 17:24:53 +0000 (09:24 -0800)]
common: re-install sighandlers after daemon()

Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
14 years agocommon: move signal handler stuff into signal.cc
Colin Patrick McCabe [Wed, 19 Jan 2011 17:15:02 +0000 (09:15 -0800)]
common: move signal handler stuff into signal.cc

Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
14 years agoReplicatedPG.cc: fix snap_trimmer object context error
Samuel Just [Thu, 20 Jan 2011 00:47:57 +0000 (16:47 -0800)]
ReplicatedPG.cc: fix snap_trimmer object context error

Previously, snap_trimmer would get the clone object information from the
object store rather than using find_object_context.  This would cause
the cached version to not be updated with the new version in the case
that the object information got updated.  As a result, the need field of
the missing object could get a stale version inconsistent with the most
recent logged version.

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agoReplicatedPG.cc: update coi version and prior_version to match log
Samuel Just [Wed, 19 Jan 2011 22:02:27 +0000 (14:02 -0800)]
ReplicatedPG.cc: update coi version and prior_version to match log

Caused error where oi on clone would not get updated version when snaps
was updated.  oi.version would lag behind the missing item's need field
during recovery.

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agoReplicatedPG.cc: fix use of potentially invalid pointer
Samuel Just [Wed, 19 Jan 2011 20:06:17 +0000 (12:06 -0800)]
ReplicatedPG.cc: fix use of potentially invalid pointer

rollback_to may not be initialized if ret != 0.

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agoReplicatedPG,PG,OSD: snap_trimmer should run only when the PG is clean
Samuel Just [Thu, 20 Jan 2011 00:47:32 +0000 (16:47 -0800)]
ReplicatedPG,PG,OSD: snap_trimmer should run only when the PG is clean

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agosignals: handle_fatal_signal: use SA_NODEFER
Colin Patrick McCabe [Tue, 28 Dec 2010 02:04:17 +0000 (18:04 -0800)]
signals: handle_fatal_signal: use SA_NODEFER

SA_RESETHAND | SA_NODEFER allows the "re-trigger default signal handler"
trick to work for signals other than SIGSEGV.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agosignals: backtrace some more exotic fatal signals
Colin Patrick McCabe [Tue, 28 Dec 2010 01:49:57 +0000 (17:49 -0800)]
signals: backtrace some more exotic fatal signals

We're not likely to see these, but if we do, we want it in the logs!

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agosignals: Handle SIGILL, SIGBUS, SIGFPE.
Colin Patrick McCabe [Tue, 28 Dec 2010 01:29:57 +0000 (17:29 -0800)]
signals: Handle SIGILL, SIGBUS, SIGFPE.

Print out a backtrace when we get SIGILL, SIGBUS, or SIGFPE. Fix a bug
where we failed to install a SIGABRT handler.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agomds: fix journaling of root default_file_layout
Sage Weil [Wed, 19 Jan 2011 17:50:41 +0000 (09:50 -0800)]
mds: fix journaling of root default_file_layout

We need to include the default_file_layout (if any) on root inodes, too.

Fixes: #725
Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomon: remove rank from failed when taking over for failed node
Sage Weil [Tue, 18 Jan 2011 23:13:07 +0000 (15:13 -0800)]
mon: remove rank from failed when taking over for failed node

Leaving it there leaves a broken MDSMap, and prevents rejoin because
MDSMap::is_rejoining() is always false.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: kick discovers when peers enter active|clientreplay|rejoin
Sage Weil [Tue, 18 Jan 2011 23:09:51 +0000 (15:09 -0800)]
mds: kick discovers when peers enter active|clientreplay|rejoin

We process discovers when active, clientreplay, or later stages of rejoin.
Wait until then to resend pending discovers.  In particular, do NOT send
them when the node has just failed (from handle_mds_failure), or we will
crash.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomon: fix 'ceph mds fail <N>' command
Sage Weil [Tue, 18 Jan 2011 21:27:10 +0000 (13:27 -0800)]
mon: fix 'ceph mds fail <N>' command

We need to remove the mds_info from the map for cmds to take notice.

Fixes: #720
Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoPG: fix adjust_local_snaps bug
Samuel Just [Tue, 18 Jan 2011 21:16:34 +0000 (13:16 -0800)]
PG: fix adjust_local_snaps bug

current must be removed from to_remove in the loop for the loop to
terminate (and not cause a double erasure from snap_collections)!

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agoMerge branch 'purged_snaps' into testing
Sage Weil [Tue, 18 Jan 2011 15:37:58 +0000 (07:37 -0800)]
Merge branch 'purged_snaps' into testing

14 years agoosd: rebind heartbeat_messenger (with cluster one) when wrongly marked down
Sage Weil [Sat, 15 Jan 2011 00:58:47 +0000 (16:58 -0800)]
osd: rebind heartbeat_messenger (with cluster one) when wrongly marked down

This keeps things clean.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomessenger: let rebind() avoid multiple ports
Sage Weil [Sat, 15 Jan 2011 00:58:19 +0000 (16:58 -0800)]
messenger: let rebind() avoid multiple ports

We need to rebind two messengers, which means avoiding both old ports.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: drop messages from before we moved back to boot state
Sage Weil [Sat, 15 Jan 2011 00:57:38 +0000 (16:57 -0800)]
osd: drop messages from before we moved back to boot state

We want to make sure we ignore any messages sent to us before we moved
back to the boot state (after being wrongly marked down).  This is only
a problem currently while we are in the BOOT state and waiting to be
re-added to the map, because we may then call _share_map_incoming and
send something on the new rebound messenger to an old peer.  Also assert
that we are !booting there to be sure.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoMDS: Use new C_Gather::get_num_remaining() in MDCache.
Greg Farnum [Sat, 15 Jan 2011 00:22:11 +0000 (16:22 -0800)]
MDS: Use new C_Gather::get_num_remaining() in MDCache.

It was using get_num(), which now reports the number created.
This probably wouldn't have worked previously except that
~C_Gather::C_GatherSub was inappropriately calling rm_sub().

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoC_Gather: Set debug #ifdefs to remove set.
Greg Farnum [Sat, 15 Jan 2011 00:12:32 +0000 (16:12 -0800)]
C_Gather: Set debug #ifdefs to remove set.

This way when we're confident it works right, we can
remove the set<Context*> and just rely on ref counting.

Further optimizations would include using a spinlock
rather than a mutex, or possibly even just switching
sub_[created|existing]_count to be atomics.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoC_Gather: Rewrite for thread safety.
Greg Farnum [Sat, 15 Jan 2011 00:11:01 +0000 (16:11 -0800)]
C_Gather: Rewrite for thread safety.

Previously, C_Gather wasn't thread safe at all,
and there was an issue with creating subs while some
subs were being finished.
These issues are now fixed.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agomds: call MonClient::shutdown when doing a journal dump.
Greg Farnum [Wed, 12 Jan 2011 22:46:30 +0000 (14:46 -0800)]
mds: call MonClient::shutdown when doing a journal dump.

Previously we got a failed assert since nothing was calling this.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoos: don't crash on no-journal case
Colin Patrick McCabe [Sun, 9 Jan 2011 21:34:40 +0000 (13:34 -0800)]
os: don't crash on no-journal case

JournalingObjectStore::commit_start should handle the case where journal is
null. This will occur if the user doesn't configure a journal.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agomds: tolerate (with warning) replayed op with bad prealloc_inos
Sage Weil [Fri, 14 Jan 2011 06:08:40 +0000 (22:08 -0800)]
mds: tolerate (with warning) replayed op with bad prealloc_inos

This comes up when an ESesssion close is followed by an EMetaBlob that
uses a prealloc_ino.  That isn't supposed to happen (it's probably a corner
case with session timeout vs a request waiting on locks that didn't
get killed/canceled?).  But tolerate it during replay just the same.

Works around #708.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: improve debug output on ESession journal replay
Sage Weil [Fri, 14 Jan 2011 05:51:05 +0000 (21:51 -0800)]
mds: improve debug output on ESession journal replay

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoOSD,ReplicatedPG: Do not run snap_trimmer while the pg is degraded
Samuel Just [Fri, 14 Jan 2011 00:18:40 +0000 (16:18 -0800)]
OSD,ReplicatedPG: Do not run snap_trimmer while the pg is degraded

snap_trimmer causes replica crashes if the replica is missing
objects.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoReplicatedPG: snap_trimmer skip removed snaps without collections
Samuel Just [Thu, 13 Jan 2011 19:15:15 +0000 (11:15 -0800)]
ReplicatedPG: snap_trimmer skip removed snaps without collections

If no writes are made between two snapshots, the first won't get a snap
collection.  Subsequently removing that snap causes a crash in
snap_trimmer since the collection does not exist.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoOSD: _pg_process_info refactor to use adjust_local_snaps
Samuel Just [Thu, 13 Jan 2011 19:10:31 +0000 (11:10 -0800)]
OSD: _pg_process_info refactor to use adjust_local_snaps

Changes _pg_process_info to use adjust_local_snaps.  Also accounts for
the incoming info not being a superset of the existing info.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoPG: added adjust_local_snaps, activate now checks local collections
Samuel Just [Thu, 13 Jan 2011 19:02:58 +0000 (11:02 -0800)]
PG: added adjust_local_snaps, activate now checks local collections

adjust_local_snaps handles removing local collections contained in
to_check.  On activate, pg will now remove local collections contained
in purged_snaps.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoPG: change snap_collections to an interval_set
Samuel Just [Tue, 21 Dec 2010 20:01:41 +0000 (12:01 -0800)]
PG: change snap_collections to an interval_set

Previously, the set of local snap collections was represented using a
set, which complicates set operations with interval_sets.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoPG: activate should not enqueue snap_trimmer on a replica
Samuel Just [Thu, 13 Jan 2011 20:18:17 +0000 (12:18 -0800)]
PG: activate should not enqueue snap_trimmer on a replica

Previously, activate would queue_snap_trim() for replicas if snap_trimq
ended up non-empty, guaranteeing a crash for any replica starting up
while purged_snaps lagged behind pool->cached_removed_snaps.

This should fix #702.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agofilejournal: rewrite completion handling, fix ordering on full->notfull
Sage Weil [Thu, 13 Jan 2011 21:14:24 +0000 (13:14 -0800)]
filejournal: rewrite completion handling, fix ordering on full->notfull

Rewriting the completion handling to be simpler, clearer, so that it is
easier to maintain a strict completion ordering invariant.

This also fixes an ordering bug: When restarting journal, we defer
initially until we get a committed_thru from the previous commit and then
do all those completions.  That same logic needs to also apply to new items
submitted during that commit interval.  This was broken before, but the
simpler structure fixes it.  Fixes #666.

Tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoReplicatedPG: Fix oi.size bug in _rollback_to
Samuel Just [Wed, 12 Jan 2011 23:09:51 +0000 (15:09 -0800)]
ReplicatedPG: Fix oi.size bug in _rollback_to

_rollback_to calls _delete_head before cloning the clone into place.
_delete_head sets the object info size to 0.  _rollback_to now resets
the size to match the rolled back object.  Previously, this bug
manifested as a failed assert in scrub when checking the object sizes.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoReplicatedPG: register_object_context and register_snapset_context cleanup
Samuel Just [Wed, 12 Jan 2011 21:51:55 +0000 (13:51 -0800)]
ReplicatedPG: register_object_context and register_snapset_context cleanup

Previously, get_object_context and get_snapset_context did not register
the resulting objects.  In some cases, these objects would not get
registered and multiple copies would end up created.  This caused a bug
in find_object_context where get_snapset_context could return an object
distinct from the one referenced by the object returned from
get_object_context.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoReplicatedPG: snap_trimmer work around
Samuel Just [Wed, 12 Jan 2011 20:07:44 +0000 (12:07 -0800)]
ReplicatedPG: snap_trimmer work around

Currently, an OSD bug is causing snap_trimq to contain some snaps
already in purged_snaps.  This work around should let kvmtest
come back up.  A real fix is still needed.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoosd: OSD::queue_pg_for_deletion: avoid double del
Colin Patrick McCabe [Tue, 11 Jan 2011 18:15:02 +0000 (10:15 -0800)]
osd: OSD::queue_pg_for_deletion: avoid double del

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agomds: avoid double-pinning stray inodes
Sage Weil [Tue, 11 Jan 2011 17:50:20 +0000 (09:50 -0800)]
mds: avoid double-pinning stray inodes

We make multiple iterations through populate_mydir().  Only pin each stray
once.  Fixes #689 and crashes like

mds/CInode.h: In function 'virtual void CInode::bad_get(int)':
mds/CInode.h:1088: FAILED assert(ref_set.count(by) == 0)
ceph version 0.24 (180a4176035521940390f4ce24ee3eb7aa290632)
1: (CInode::bad_put(int)+0) [0x827b090]
2: (MDSCacheObject::get(int)+0x153) [0x813e463]
3: (MDCache::populate_mydir()+0x8a) [0x81a7e5a]
4: (MDCache::_create_system_file_finish(Mutation*, CDentry*,
Context*)+0x181) [0x819f501]
5: (C_MDC_CreateSystemFile::finish(int)+0x29) [0x81d6c29]
6: (finish_contexts(std::list<Context*, std::allocator<Context*> >&,
int)+0x6b) [0x81d663b]
7: (Journaler::_finish_flush(int, long long, utime_t, bool)+0x983) [0x82f2f53]
8: (Journaler::C_Flush::finish(int)+0x3f) [0x82fb24f]
9: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x801) [0x82d8e31]
10: (MDS::_dispatch(Message*)+0x2ae5) [0x80eaa15]
11: (MDS::ms_dispatch(Message*)+0x62) [0x80eb142]
12: (SimpleMessenger::dispatch_entry()+0x899) [0x80b8649]
13: (SimpleMessenger::DispatchThread::entry()+0x22) [0x80b30f2]

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoReplicatedPG: Fix bug in rollback
Samuel Just [Mon, 10 Jan 2011 22:45:06 +0000 (14:45 -0800)]
ReplicatedPG: Fix bug in rollback

Previously, _rollback_to assumed that the rollback was a noop if
ctx->clone_obc was set and it's prior version matches head's version.
However, this broke in sequences like:

Write "snap1 contents" to oid "blah"
create snapshot "snap1"
Write "snap2 contents" to oid "blah"
create snapshot "snap2"
rollback oid "blah" to snapshot "snap1"

In this case, make_writeable would have just cloned head to the snap2
clone, but the relevant clone is actually "snap1".  _rollback_to now
verifies that the most recent clone is the correct one before assuming
that head is already correct.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agov0.24.1 v0.24.1
Sage Weil [Sat, 8 Jan 2011 00:50:15 +0000 (16:50 -0800)]
v0.24.1

14 years agoReplicatedPG: get_object_context ssc refcount leak
Samuel Just [Fri, 7 Jan 2011 22:23:04 +0000 (14:23 -0800)]
ReplicatedPG: get_object_context ssc refcount leak

If obc->obs.ssc is non-null, the second get_snapset_context ends up
leaking a snapset reference.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoReplicatedPG: clone_overlap should contain one entry per clone
Samuel Just [Thu, 6 Jan 2011 23:48:13 +0000 (15:48 -0800)]
ReplicatedPG: clone_overlap should contain one entry per clone

Previously, writefull and _delete_head would remove the last
entry from snapset.clone_overlap.  Now, the last entry becomes
an empty interval_set.  clone_overlap should contain one entry
per clone.

The missing entries previously caused a bug in _rollback_to where
iter would be clone_overlap.end().

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoPG: Fixes bug in _scrub with checking clones
Samuel Just [Tue, 4 Jan 2011 22:30:15 +0000 (14:30 -0800)]
PG: Fixes bug in _scrub with checking clones

I introduced this bug in
4a4a1e53c7d380cd0b582c1d0685fd0ef4ef1711.
curclone++ not curclone--.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoPG: Fix bug in scrub when checking clone sizes
Samuel Just [Tue, 4 Jan 2011 00:48:39 +0000 (16:48 -0800)]
PG: Fix bug in scrub when checking clone sizes

Previosly, _scrub checked:
assert(p->second.size == snapset.clone_size[curclone])

curclone was, however, an index into snapset.clones rather than a
snapid_t.  For clarity, curclone is now an iterator.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agomds: assert no submit_entry during replay state
Sage Weil [Fri, 24 Dec 2010 17:00:28 +0000 (09:00 -0800)]
mds: assert no submit_entry during replay state

We should never submit items to the journal during replay.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: start new log segment resolve start, not replay finish
Sage Weil [Fri, 24 Dec 2010 17:00:02 +0000 (09:00 -0800)]
mds: start new log segment resolve start, not replay finish

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: clean up backlog generation checks a bit
Sage Weil [Fri, 24 Dec 2010 16:36:28 +0000 (08:36 -0800)]
osd: clean up backlog generation checks a bit

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: generate backlog if needed to get last_complete >= log.tail || backlog
Sage Weil [Fri, 24 Dec 2010 16:36:05 +0000 (08:36 -0800)]
osd: generate backlog if needed to get last_complete >= log.tail || backlog

If primary or a replica has a mistrimmed pg log, we need to generate the
backlog during peering.  This sucks, because the PG won't go active for
a long time, but it's what happens when there's a bug in the code that
mis-trims the PG log!

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: send sufficient log to compensate for replicas with last_complate < log.tail
Sage Weil [Fri, 24 Dec 2010 16:27:38 +0000 (08:27 -0800)]
osd: send sufficient log to compensate for replicas with last_complate < log.tail

If a replica has last_complete < log.tail and no backlog, send enough log
for them to get back into a consistent state.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: load root inode on replay if auth
Sage Weil [Mon, 3 Jan 2011 22:32:48 +0000 (14:32 -0800)]
mds: load root inode on replay if auth

If we are auth for the root inode, load it's initial value off of disk. We
may not see it in the log if it has not been modified.  If it has, this
is useless but fast/harmless.  This only occurs for brand-new filesystems
where the mds is immediately restarted.

Fixes #671.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomsgr: Unlock dispatch_queue.lock when short-circuiting queue_received.
Greg Farnum [Mon, 3 Jan 2011 22:14:00 +0000 (14:14 -0800)]
msgr: Unlock dispatch_queue.lock when short-circuiting queue_received.

Previously we left the mutex locked, which is obviously bad bad bad!
I believe this was the cause of #673.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agofilestore: assert on out of order journal pipeline submissions
Sage Weil [Mon, 3 Jan 2011 21:14:49 +0000 (13:14 -0800)]
filestore: assert on out of order journal pipeline submissions

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agofilestore: fix wake condition when journal submission blocks
Sage Weil [Mon, 3 Jan 2011 21:14:13 +0000 (13:14 -0800)]
filestore: fix wake condition when journal submission blocks

We only want to wake up if we are at the front of the line, in order to
preserve journal submission pipeline ordering.

This fixes, among other things, messages in the log like

2010-12-21 10:38:42.515974 7f0861486700 journal op_submit_finish 5364 expected 5370, OUT OF ORDER

and bug #666.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: fix purge_stray for directories, zeroed layouts
Sage Weil [Mon, 3 Jan 2011 19:50:53 +0000 (11:50 -0800)]
mds: fix purge_stray for directories, zeroed layouts

- We don't want to purge file content on directories
- Don't fall over if a file has a zero period

Reported-by: Paul Komkoff <i@stingr.net>
Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: PG::Info::History: init last_epoch_clean
Colin Patrick McCabe [Wed, 29 Dec 2010 01:03:12 +0000 (17:03 -0800)]
osd: PG::Info::History: init last_epoch_clean

It seems that we have not been zeroing
PG::Info::History:last_epoch_clean when the History structure is
created. This led to some very interesting log output (and bugs!)

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoSimpleMessenger.cc: Fixes a dispatch_throttler leak in queue_received
Samuel Just [Wed, 1 Dec 2010 00:52:40 +0000 (16:52 -0800)]
SimpleMessenger.cc: Fixes a dispatch_throttler leak in queue_received
when the pipe has been halted.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agov0.24 v0.24
Sage Weil [Wed, 15 Dec 2010 21:01:57 +0000 (13:01 -0800)]
v0.24

14 years agoosd: compensate for replicas with tail > last_complete
Sage Weil [Mon, 20 Dec 2010 21:22:49 +0000 (13:22 -0800)]
osd: compensate for replicas with tail > last_complete

Normally we shouldn't ever have a last_complete < log.tail (&& !backlog).
But maybe we do (old bugs, whatever; see #590).  In that case, the primary
can compensate by sending more log info to the replica.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: make nested scatterlock state change check more robust
Sage Weil [Sat, 18 Dec 2010 05:02:58 +0000 (21:02 -0800)]
mds: make nested scatterlock state change check more robust

The predirty_journal_parents() calls wrlock_start() with nowait=true
because it has a journal entry open and we don't want to trigger a nested
scatterlock change that needs to journal something again (either
via scatter_writebehind or scatter_start).  (MDLog can only handle a single
log entry open at once because building multiple at once would require very
very very careful ordering of predirty() calls and versions.)

We were already check for the simple_lock() case (which may call
writebehind); fix up the check to also cover the scatter_mix() (which may
call scatter_start) case.

Fixes this crash:

mds/MDLog.h: In function 'void MDLog::start_entry(LogEvent*)':
mds/MDLog.h:191: FAILED assert(cur_event == __null)
 ceph version 0.24~rc (commit:fe10300317383ec29948d7dbe3cb31b3aa277e3c)
 1: (CInode::finish_scatter_update(ScatterLock*, CDir*, unsigned long, unsigned long)+0x804) [0x606e14]
 2: (CInode::start_scatter(ScatterLock*)+0xaa) [0x60dc1a]
 3: (Locker::scatter_mix(ScatterLock*, bool*)+0x1ca) [0x589a9a]
 4: (Locker::wrlock_start(SimpleLock*, MDRequest*, bool)+0x165) [0x597d65]
 5: (MDCache::predirty_journal_parents(Mutation*, EMetaBlob*, CInode*, CDir*, int, int, snapid_t)+0x153e) [0x55a70e]
 6: (Locker::scatter_writebehind(ScatterLock*)+0x42d) [0x58553d]
 7: (Locker::simple_lock(SimpleLock*, bool*)+0x7ab) [0x58beeb]
 8: (Locker::scatter_nudge(ScatterLock*, Context*, bool)+0x3ad) [0x58c49d]
 9: (Locker::scatter_tick()+0x28a) [0x58c98a]
 10: (MDS::tick()+0x4e4) [0x4b26a4]
 11: (SafeTimer::timer_thread()+0x22c) [0x6d164c]
 12: (SafeTimerThread::entry()+0xd) [0x6d34bd]
 13: (Thread::_entry_func(void*)+0xa) [0x4943da]
 14: /lib/libpthread.so.0 [0x7fc87810b73a]
 15: (clone()+0x6d) [0x7fc876dad69d]

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agofilestore: make OpSequencer::flush() work for writeahead journaling items
Sage Weil [Fri, 17 Dec 2010 23:12:17 +0000 (15:12 -0800)]
filestore: make OpSequencer::flush() work for writeahead journaling items

It was only waiting for items in the op_queue to complete.  The goal is
to wait for anything we've called queue_transactions(&osr,...) on. If we
do writeahead journaling, though, there might be new ops that are still
journaling but not yet submitted to the fs that are missed.

This adds a journal queue to the OpSequencer, and uses it in the writeahead
case only.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomon: build_initial_monmap: fix mismatched alloc
Colin Patrick McCabe [Fri, 17 Dec 2010 23:31:41 +0000 (15:31 -0800)]
mon: build_initial_monmap: fix mismatched alloc

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agocommon: cleanups
Colin Patrick McCabe [Fri, 17 Dec 2010 23:06:40 +0000 (15:06 -0800)]
common: cleanups

common_init: avoid (mismatched) heap allocation

ConfFile::_parse: avoid memory leak on error path

ConfFile: NULL filename if not set, rather than leaving it undefined

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoosd: PG::choose_acting: fix major iterator mistake
Colin Patrick McCabe [Fri, 17 Dec 2010 23:05:56 +0000 (15:05 -0800)]
osd: PG::choose_acting: fix major iterator mistake

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agorgw: fix fd leak on error path
Colin Patrick McCabe [Fri, 17 Dec 2010 23:05:36 +0000 (15:05 -0800)]
rgw: fix fd leak on error path

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agohadoop: fix a bunch of mismatched allocations
Colin Patrick McCabe [Fri, 17 Dec 2010 23:04:17 +0000 (15:04 -0800)]
hadoop: fix a bunch of mismatched allocations

Using array new means you need array delete.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoauth: avoid mismatched allocation
Colin Patrick McCabe [Fri, 17 Dec 2010 23:03:37 +0000 (15:03 -0800)]
auth: avoid mismatched allocation

Can't pair strdup and free.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoosd: flush pg writes to disk before starting scrub scan
Sage Weil [Fri, 17 Dec 2010 20:54:38 +0000 (12:54 -0800)]
osd: flush pg writes to disk before starting scrub scan

This avoids two races:
 - we just completed recovery by pushing objects to the replica, and the
   replica starts scanning before those writes reach the fs.
 - we just trimmed to something after last_update_applied.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agofilestore: add per-sequencer flush operation
Sage Weil [Fri, 17 Dec 2010 20:51:19 +0000 (12:51 -0800)]
filestore: add per-sequencer flush operation

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: debug scan_list and scrub a bit better
Sage Weil [Fri, 17 Dec 2010 20:51:03 +0000 (12:51 -0800)]
osd: debug scan_list and scrub a bit better

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: clear INCONSISTENT if scrub detects no errors
Sage Weil [Fri, 17 Dec 2010 18:59:45 +0000 (10:59 -0800)]
osd: clear INCONSISTENT if scrub detects no errors

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: add assert that we're replica
Sage Weil [Fri, 17 Dec 2010 18:36:34 +0000 (10:36 -0800)]
osd: add assert that we're replica

ar Fred saw a crash where we got into merge_log as a stray, which really
shouldn't ever happen!  See #590.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agodebian: don't strip rados classes
Laszlo Boszormenyi [Fri, 17 Dec 2010 16:31:00 +0000 (08:31 -0800)]
debian: don't strip rados classes

Signed-off-by: Laszlo Boszormenyi <gcs@debian.hu>
Signed-off-by: Sage Weil <sage@newdream.net>
14 years agodebian: rename ceph.lintian -> ceph.lintian-overrides
Laszlo Boszormenyi [Fri, 17 Dec 2010 16:30:03 +0000 (08:30 -0800)]
debian: rename ceph.lintian -> ceph.lintian-overrides

Signed-off-by: Laszlo Boszormenyi <gcs@debian.hu>
Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoPG.cc:
Samuel Just [Thu, 16 Dec 2010 21:06:43 +0000 (13:06 -0800)]
PG.cc:
sub_op_scrub must set finalizing_scrub on the replica
before waiting for last_update_applied to catch up to
info.last_update.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoReplicatedPG.cc:
Samuel Just [Wed, 15 Dec 2010 20:57:10 +0000 (12:57 -0800)]
ReplicatedPG.cc:
_scrub must set head when it encounters a head snap
curclone counts down, not up

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agofilestore: detect final version of async ioctl SNAP_CREATE_V2
Sage Weil [Sat, 11 Dec 2010 00:26:06 +0000 (16:26 -0800)]
filestore: detect final version of async ioctl SNAP_CREATE_V2

Li's revised interface for the async snap ioctl is more flexible.  Update
the ioctl call sites and detection code accordingly.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: Save straydn in mdr so it's consistent across retry attempts.
Greg Farnum [Wed, 15 Dec 2010 18:56:18 +0000 (10:56 -0800)]
mds: Save straydn in mdr so it's consistent across retry attempts.

Otherwise, we could choose new stray dirs and fail to get all
the locks we needed (while leaving old strays locked forever!).

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agomon: trim pgmap less aggressively
Sage Weil [Tue, 14 Dec 2010 17:26:12 +0000 (09:26 -0800)]
mon: trim pgmap less aggressively

This will make observer crashes due to missed states (#648) much harder to
hit.  Eventually the pgmap state trim problem will go away when the
monitor/paxos code is restructured (#647).

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agocrypto: catch cryptopp decrypt/encrypt exceptions
Yehuda Sadeh [Tue, 14 Dec 2010 18:51:08 +0000 (10:51 -0800)]
crypto: catch cryptopp decrypt/encrypt exceptions

14 years agoosd: PG::prior_set_affected: const cleanup
Colin Patrick McCabe [Tue, 14 Dec 2010 09:53:37 +0000 (01:53 -0800)]
osd: PG::prior_set_affected: const cleanup

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agomds: fix replay/resent vs completed request check
Sage Weil [Sun, 12 Dec 2010 22:39:48 +0000 (14:39 -0800)]
mds: fix replay/resent vs completed request check

If it is a _replayed_ request, we should always send a simple ack if it is
completed, because the client doesn't not care about any additional caps.

If it is a _resent_ request, then we want to return useful caps on open or
create requests, even if any modification side-effects have already been
committed.  The additional checks for completed already exist in the
create and open handlers.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agorpm: update changelog
Colin Patrick McCabe [Thu, 9 Dec 2010 22:38:08 +0000 (14:38 -0800)]
rpm: update changelog

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agorpm: fix ceph.spec to work with gcephtool
Colin Patrick McCabe [Thu, 9 Dec 2010 19:46:28 +0000 (11:46 -0800)]
rpm: fix ceph.spec to work with gcephtool

Don't try to package gui_resources unless we are building the GUI.
Get GUI dependencies correct.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoFix overflow in FileJournal::_open_file()
Vangelis Koukis [Thu, 9 Dec 2010 18:53:22 +0000 (20:53 +0200)]
Fix overflow in FileJournal::_open_file()

[ The following text is in the "iso-8859-7" character set. ]
    [ Your display is set for the "iso-8859-1" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Running the unstable branch, mkcephfs fails when trying to create
a 3GB journal file on the OSDs.

Relevant messages from the osd logfile:

2010-12-09 19:03:54.419737 7fdde4d51720 journal _open_file: unable to extend journal to 18446744072560312320 bytes
2010-12-09 19:03:54.419789 7fdde4d51720 filestore(/osd) mkjournal error creating journal on /osd/journal

The problem is that the calculation of the journal size in bytes
overflows, in FileJournal::_open_file().

Signed-off-by: Vangelis Koukis <vkoukis@cslab.ece.ntua.gr>
Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoReplicatedPG.cc: Fixes a bug in snap_trimmer where a pointer to a stack
Samuel Just [Thu, 9 Dec 2010 18:25:39 +0000 (10:25 -0800)]
ReplicatedPG.cc: Fixes a bug in snap_trimmer where a pointer to a stack
Cond is left in the mode.waiting_cond list.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoReplicatedPG: snap_trimmer now acquires a read lock on the osd map
Samuel Just [Thu, 9 Dec 2010 18:24:34 +0000 (10:24 -0800)]
ReplicatedPG: snap_trimmer now acquires a read lock on the osd map
before calling share_pg_info.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agorpm: don't try to package radosacl
Colin Patrick McCabe [Thu, 9 Dec 2010 18:59:57 +0000 (10:59 -0800)]
rpm: don't try to package radosacl

radosacl is just a test binary, so unless we build with --with-debug, we
won't get it.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agorpm: add pkgconfig to BuildRequires
Colin Patrick McCabe [Thu, 9 Dec 2010 18:39:34 +0000 (10:39 -0800)]
rpm: add pkgconfig to BuildRequires

You can't build without pkgconfig.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agorpm: set files-attr for radosgw
Colin Patrick McCabe [Thu, 9 Dec 2010 18:26:55 +0000 (10:26 -0800)]
rpm: set files-attr for radosgw

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agofilejournal: reset last_commited_seq if we find journal to be invalid
Sage Weil [Wed, 8 Dec 2010 23:53:13 +0000 (15:53 -0800)]
filejournal: reset last_commited_seq if we find journal to be invalid

If we read an event that's later than our expected entry, we set read_pos
to -1 and discard the journal.  If that happens we also need to reset
last_committed_seq to avoid a crash like

2010-12-08 17:04:39.246950 7f269d138910 journal commit_finish thru 16904
2010-12-08 17:04:39.246961 7f269d138910 journal committed_thru 16904 < last_committed_seq 37778589
os/FileJournal.cc: In function 'virtual void FileJournal::committed_thru(uint64_t)':
os/FileJournal.cc:854: FAILED assert(seq >= last_committed_seq)
 ceph version 0.24~rc (commit:fe10300317383ec29948d7dbe3cb31b3aa277e3c)
 1: (FileJournal::committed_thru(unsigned long)+0xad) [0x588e7d]
 2: (JournalingObjectStore::commit_finish()+0x8c) [0x57f2ec]
 3: (FileStore::sync_entry()+0xcff) [0x5764cf]
 4: (FileStore::SyncThread::entry()+0xd) [0x506d9d]
 5: (Thread::_entry_func(void*)+0xa) [0x4790ba]
 6: /lib/libpthread.so.0 [0x7f26a2f8373a]
 7: (clone()+0x6d) [0x7f26a1c2569d]

Fixes #631

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomon: use helper for clock drift check; log relative instead of absolute time
Sage Weil [Wed, 8 Dec 2010 19:12:51 +0000 (11:12 -0800)]
mon: use helper for clock drift check; log relative instead of absolute time

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: sync->mix replica state is sync->mix(2)
Sage Weil [Tue, 7 Dec 2010 21:31:01 +0000 (13:31 -0800)]
mds: sync->mix replica state is sync->mix(2)

When auth first moves to sync->mix,
 - auth sends AC_MIX to replicas
 - replicas go to sync->mix
 - replicas finish gather, send AC_SYNCACK, move to sync->mix(2)
 - auth gets all acks, sends AC_MIX again
 - replica moves to MIX

So any new replica should just get sync->mix(2), so that it is not confused
by the second AC_MIX.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: no not choose lock state on replicas
Sage Weil [Tue, 7 Dec 2010 20:50:15 +0000 (12:50 -0800)]
mds: no not choose lock state on replicas

The lock state has already been set during rejoin.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: small rejoin cleanup
Sage Weil [Tue, 7 Dec 2010 20:45:04 +0000 (12:45 -0800)]
mds: small rejoin cleanup

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: rev mds cluster internal protocol
Sage Weil [Tue, 7 Dec 2010 19:26:24 +0000 (11:26 -0800)]
mds: rev mds cluster internal protocol

The lock encoding changed with the dirty bit on scatterlocks.

Signed-off-by: Sage Weil <sage@newdream.net>