]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
14 years agomdcache: change replay trimming a bit.
Greg Farnum [Mon, 20 Dec 2010 21:32:43 +0000 (13:32 -0800)]
mdcache: change replay trimming a bit.

Previously we were re-inserting dentrys on the open list. But if
there weren't any other available dentrys to trim, this could
have led to an infinite loop!
Now, we save them in a list and pop them back in once the trim
is done.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoMDS: rename replay Contexts -- they were ambiguous at best.
Greg Farnum [Mon, 20 Dec 2010 21:10:44 +0000 (13:10 -0800)]
MDS: rename replay Contexts -- they were ambiguous at best.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoMDS: add gids to the logger file names.
Greg Farnum [Fri, 17 Dec 2010 23:56:44 +0000 (15:56 -0800)]
MDS: add gids to the logger file names.

This is just to make differentiating between the standby's files
and stuff easier.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agomdlog: return EAGAIN if replay falls off the tail of the journal.
Greg Farnum [Fri, 17 Dec 2010 21:25:04 +0000 (13:25 -0800)]
mdlog: return EAGAIN if replay falls off the tail of the journal.

This can happen when we're following an active journal, and
would previously cause the MDS to shut down. Now we return EAGAIN,
so the MDS can recover as it likes.
Currently, that recovery is a simple respawn, as when we discover
we've fallen behind via probing.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agojournaler: Add init_headers function, call when reading head off disk.
Greg Farnum [Fri, 17 Dec 2010 00:47:30 +0000 (16:47 -0800)]
journaler: Add init_headers function, call when reading head off disk.

Uninitialized headers were causing a failed assert during replay,
and there's no good reason to leave them set at their defaults just
because the *current* incarnation of this MDS has never written to
disk!

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agomds: After probing the journal, reset if we've fallen behind.
Greg Farnum [Thu, 16 Dec 2010 19:53:38 +0000 (11:53 -0800)]
mds: After probing the journal, reset if we've fallen behind.

Previously, if the journal got trimmed and we missed log entries,
we failed out in the journaling step and stopped.
This is still possible and needs to be fixed, but pre-emptively checking
that we're still in the live part of the journal narrows the race range.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoMDS: make standby_trim_segments functional. Hurray, hot standbys work!
Greg Farnum [Wed, 15 Dec 2010 00:45:50 +0000 (16:45 -0800)]
MDS: make standby_trim_segments functional. Hurray, hot standbys work!

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agomdlog: Add some helper functions for accessing segments map data.
Greg Farnum [Wed, 15 Dec 2010 00:45:21 +0000 (16:45 -0800)]
mdlog: Add some helper functions for accessing segments map data.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agomdcache: adjust trim() to handle running during standby-replay.
Greg Farnum [Wed, 15 Dec 2010 00:44:55 +0000 (16:44 -0800)]
mdcache: adjust trim() to handle running during standby-replay.

This just means it needs to handle files on the open list and not
trim them. Add a check for that with an assert, and keep them alive.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoelist: add a clear_list function.
Greg Farnum [Wed, 15 Dec 2010 00:43:45 +0000 (16:43 -0800)]
elist: add a clear_list function.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agolru: change control flow and an assert to keep purpose clearer.
Greg Farnum [Tue, 14 Dec 2010 18:37:13 +0000 (10:37 -0800)]
lru: change control flow and an assert to keep purpose clearer.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoMDSMonitor: Remove STATE_ONESHOT_REPLAY from takeover logic in tick().
Greg Farnum [Thu, 9 Dec 2010 00:30:32 +0000 (16:30 -0800)]
MDSMonitor: Remove STATE_ONESHOT_REPLAY from takeover logic in tick().

If something dies during a journal-check we shouldn't have anybody
doing standby for them, so assert out!

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoMDSMonitor: Do not set the rank of an MDS in standby-replay
Greg Farnum [Wed, 8 Dec 2010 17:42:57 +0000 (09:42 -0800)]
MDSMonitor: Do not set the rank of an MDS in standby-replay
or oneshot-replay modes.

This was causing issues with identification in various circumstances,
and turns out to be unnecessary. The MDS now will set its whoami
variable from the standby_for_rank field if that's appropriate.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoMDS: MDSMonitor: if MDS is in standby-replay and its leader goes down,
Greg Farnum [Wed, 8 Dec 2010 17:39:59 +0000 (09:39 -0800)]
MDS: MDSMonitor: if MDS is in standby-replay and its leader goes down,
take over as the MDS!

This means we can now exit standby-replay.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoMDLog: don't change expire_pos or read_pos on replay.
Greg Farnum [Tue, 7 Dec 2010 20:46:10 +0000 (12:46 -0800)]
MDLog: don't change expire_pos or read_pos on replay.

These are unnecessary or rendered irrelevant by previous commit
removing read_pos from the on-disk Header.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoJournaler: Remove the unused read_pos field.
Greg Farnum [Tue, 7 Dec 2010 19:48:08 +0000 (11:48 -0800)]
Journaler: Remove the unused read_pos field.

Rename it to unused_field, fill the in-memory read_pos
from header.expire_pos, and fill unused_field with the expire_pos
for safety.
(The on-disk header pos was used to fill in read_pos, but it was
always reset to expire_pos before being used and was only ever
set at the end of replay.)

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoMDS: miscellaneous standby-replay fixes and cleanups.
Greg Farnum [Fri, 3 Dec 2010 00:38:00 +0000 (16:38 -0800)]
MDS: miscellaneous standby-replay fixes and cleanups.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoMDS: make use of the hooks to start standby-replay.
Greg Farnum [Fri, 3 Dec 2010 00:36:22 +0000 (16:36 -0800)]
MDS: make use of the hooks to start standby-replay.

This doesn't include trim, and there's no way to exit the replay!

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoMDS: Implement the hooks for standby_replay.
Greg Farnum [Wed, 1 Dec 2010 21:28:44 +0000 (13:28 -0800)]
MDS: Implement the hooks for standby_replay.

This commit adds the necessary state checks and machinery
for the MDS to go through a "looping" replay.
It does not yet implement online trimming, nor is there any
way to get the MDS into or out of a standby_replay state.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agojournaler: add reread_head_and_probe function.
Greg Farnum [Wed, 1 Dec 2010 18:05:16 +0000 (10:05 -0800)]
journaler: add reread_head_and_probe function.

It does both so callers don't need to implement
intermediate bottom-half handlers.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agomds: add expire_pos to the ESubtreeMap.
Greg Farnum [Tue, 30 Nov 2010 22:00:32 +0000 (14:00 -0800)]
mds: add expire_pos to the ESubtreeMap.

This will allow more efficient trimming during standby_replay.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agomds: extend the use of uint64_t instead of (signed) loff_t, et al.
Greg Farnum [Wed, 24 Nov 2010 21:44:37 +0000 (13:44 -0800)]
mds: extend the use of uint64_t instead of (signed) loff_t, et al.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agomds: rename is_standby_replay() to is_oneshot_replay.
Greg Farnum [Wed, 24 Nov 2010 21:28:49 +0000 (13:28 -0800)]
mds: rename is_standby_replay() to is_oneshot_replay.
This better represents its current purpose.

14 years agomds: Create new STATE_ONESHOT_REPLAY for the MDS.
Greg Farnum [Wed, 24 Nov 2010 00:20:05 +0000 (16:20 -0800)]
mds: Create new STATE_ONESHOT_REPLAY for the MDS.

This takes over the previous behavior of STATE_STANDBY_REPLAY,
allowing standby-replay to be used for the upcoming continuous-replay
that will enable hot standbys.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoJournaler: make reprobe() an asynchronous function.
Greg Farnum [Tue, 23 Nov 2010 00:19:06 +0000 (16:19 -0800)]
Journaler: make reprobe() an asynchronous function.

This better fits the spirit of the other functions, and the MDS itself.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoJournaler: make reread_head an asynchronous function.
Greg Farnum [Mon, 22 Nov 2010 20:39:34 +0000 (12:39 -0800)]
Journaler: make reread_head an asynchronous function.

This better fits the spirit of the other functions, and the MDS itself.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoJournaler: redefine states to make them all unique.
Greg Farnum [Mon, 22 Nov 2010 18:54:54 +0000 (10:54 -0800)]
Journaler: redefine states to make them all unique.

Apparently PROBING and ACTIVE being identical was a mistake.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoJournaler: Set the privacy of new functions correctly.
Greg Farnum [Fri, 19 Nov 2010 18:48:38 +0000 (10:48 -0800)]
Journaler: Set the privacy of new functions correctly.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoJournaler: use uint64_6 instead of int64_t.
Greg Farnum [Fri, 19 Nov 2010 18:36:40 +0000 (10:36 -0800)]
Journaler: use uint64_6 instead of int64_t.

Since the values can never be negative, this is far more appropriate,
and it results in fewer casts than the other way around.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoJournaler: Add function reprobe, to search for the new end of log.
Greg Farnum [Fri, 19 Nov 2010 18:13:47 +0000 (10:13 -0800)]
Journaler: Add function reprobe, to search for the new end of log.

Add new REPROBING state and split up new function probe() from _finish_read_head.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoJournaler: Add reset() function, which returns it to the immediate post-ctor state
Greg Farnum [Fri, 19 Nov 2010 02:19:30 +0000 (18:19 -0800)]
Journaler: Add reset() function, which returns it to the immediate post-ctor state

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoJournaler: Add a read-only setting, and asserts to make it fail on writes if readonly.
Greg Farnum [Thu, 18 Nov 2010 22:51:38 +0000 (14:51 -0800)]
Journaler: Add a read-only setting, and asserts to make it fail on writes if readonly.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoJournaler: add new reread_head function and state.
Greg Farnum [Thu, 18 Nov 2010 19:56:04 +0000 (11:56 -0800)]
Journaler: add new reread_head function and state.

This is to facilitate the forthcoming up_shadow MDS state.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoJournaler: remove unused vector<snapid_t> snaps from recover().
Greg Farnum [Thu, 18 Nov 2010 00:36:46 +0000 (16:36 -0800)]
Journaler: remove unused vector<snapid_t> snaps from recover().

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoJournaler: set state to STATE_ACTIVE in _finish_probe_end.
Greg Farnum [Thu, 18 Nov 2010 00:33:41 +0000 (16:33 -0800)]
Journaler: set state to STATE_ACTIVE in _finish_probe_end.

This was never actually getting set, although it doesn't matter
since STATE_ACTIVE and STATE_PROBING are defined to be the same.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoPG: Fixes bug in _scrub with checking clones
Samuel Just [Tue, 4 Jan 2011 22:30:15 +0000 (14:30 -0800)]
PG: Fixes bug in _scrub with checking clones

I introduced this bug in
4a4a1e53c7d380cd0b582c1d0685fd0ef4ef1711.
curclone++ not curclone--.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoPG: Fix bug in scrub when checking clone sizes
Samuel Just [Tue, 4 Jan 2011 00:48:39 +0000 (16:48 -0800)]
PG: Fix bug in scrub when checking clone sizes

Previosly, _scrub checked:
assert(p->second.size == snapset.clone_size[curclone])

curclone was, however, an index into snapset.clones rather than a
snapid_t.  For clarity, curclone is now an iterator.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agomds: assert no submit_entry during replay state
Sage Weil [Fri, 24 Dec 2010 17:00:28 +0000 (09:00 -0800)]
mds: assert no submit_entry during replay state

We should never submit items to the journal during replay.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: start new log segment resolve start, not replay finish
Sage Weil [Fri, 24 Dec 2010 17:00:02 +0000 (09:00 -0800)]
mds: start new log segment resolve start, not replay finish

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: clean up backlog generation checks a bit
Sage Weil [Fri, 24 Dec 2010 16:36:28 +0000 (08:36 -0800)]
osd: clean up backlog generation checks a bit

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: generate backlog if needed to get last_complete >= log.tail || backlog
Sage Weil [Fri, 24 Dec 2010 16:36:05 +0000 (08:36 -0800)]
osd: generate backlog if needed to get last_complete >= log.tail || backlog

If primary or a replica has a mistrimmed pg log, we need to generate the
backlog during peering.  This sucks, because the PG won't go active for
a long time, but it's what happens when there's a bug in the code that
mis-trims the PG log!

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: send sufficient log to compensate for replicas with last_complate < log.tail
Sage Weil [Fri, 24 Dec 2010 16:27:38 +0000 (08:27 -0800)]
osd: send sufficient log to compensate for replicas with last_complate < log.tail

If a replica has last_complete < log.tail and no backlog, send enough log
for them to get back into a consistent state.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: load root inode on replay if auth
Sage Weil [Mon, 3 Jan 2011 22:32:48 +0000 (14:32 -0800)]
mds: load root inode on replay if auth

If we are auth for the root inode, load it's initial value off of disk. We
may not see it in the log if it has not been modified.  If it has, this
is useless but fast/harmless.  This only occurs for brand-new filesystems
where the mds is immediately restarted.

Fixes #671.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomsgr: Unlock dispatch_queue.lock when short-circuiting queue_received.
Greg Farnum [Mon, 3 Jan 2011 22:14:00 +0000 (14:14 -0800)]
msgr: Unlock dispatch_queue.lock when short-circuiting queue_received.

Previously we left the mutex locked, which is obviously bad bad bad!
I believe this was the cause of #673.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agofilestore: assert on out of order journal pipeline submissions
Sage Weil [Mon, 3 Jan 2011 21:14:49 +0000 (13:14 -0800)]
filestore: assert on out of order journal pipeline submissions

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agofilestore: fix wake condition when journal submission blocks
Sage Weil [Mon, 3 Jan 2011 21:14:13 +0000 (13:14 -0800)]
filestore: fix wake condition when journal submission blocks

We only want to wake up if we are at the front of the line, in order to
preserve journal submission pipeline ordering.

This fixes, among other things, messages in the log like

2010-12-21 10:38:42.515974 7f0861486700 journal op_submit_finish 5364 expected 5370, OUT OF ORDER

and bug #666.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: fix purge_stray for directories, zeroed layouts
Sage Weil [Mon, 3 Jan 2011 19:50:53 +0000 (11:50 -0800)]
mds: fix purge_stray for directories, zeroed layouts

- We don't want to purge file content on directories
- Don't fall over if a file has a zero period

Reported-by: Paul Komkoff <i@stingr.net>
Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: PG::Info::History: init last_epoch_clean
Colin Patrick McCabe [Wed, 29 Dec 2010 01:03:12 +0000 (17:03 -0800)]
osd: PG::Info::History: init last_epoch_clean

It seems that we have not been zeroing
PG::Info::History:last_epoch_clean when the History structure is
created. This led to some very interesting log output (and bugs!)

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoSimpleMessenger.cc: Fixes a dispatch_throttler leak in queue_received
Samuel Just [Wed, 1 Dec 2010 00:52:40 +0000 (16:52 -0800)]
SimpleMessenger.cc: Fixes a dispatch_throttler leak in queue_received
when the pipe has been halted.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agov0.24 v0.24
Sage Weil [Wed, 15 Dec 2010 21:01:57 +0000 (13:01 -0800)]
v0.24

14 years agoosd: compensate for replicas with tail > last_complete
Sage Weil [Mon, 20 Dec 2010 21:22:49 +0000 (13:22 -0800)]
osd: compensate for replicas with tail > last_complete

Normally we shouldn't ever have a last_complete < log.tail (&& !backlog).
But maybe we do (old bugs, whatever; see #590).  In that case, the primary
can compensate by sending more log info to the replica.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: make nested scatterlock state change check more robust
Sage Weil [Sat, 18 Dec 2010 05:02:58 +0000 (21:02 -0800)]
mds: make nested scatterlock state change check more robust

The predirty_journal_parents() calls wrlock_start() with nowait=true
because it has a journal entry open and we don't want to trigger a nested
scatterlock change that needs to journal something again (either
via scatter_writebehind or scatter_start).  (MDLog can only handle a single
log entry open at once because building multiple at once would require very
very very careful ordering of predirty() calls and versions.)

We were already check for the simple_lock() case (which may call
writebehind); fix up the check to also cover the scatter_mix() (which may
call scatter_start) case.

Fixes this crash:

mds/MDLog.h: In function 'void MDLog::start_entry(LogEvent*)':
mds/MDLog.h:191: FAILED assert(cur_event == __null)
 ceph version 0.24~rc (commit:fe10300317383ec29948d7dbe3cb31b3aa277e3c)
 1: (CInode::finish_scatter_update(ScatterLock*, CDir*, unsigned long, unsigned long)+0x804) [0x606e14]
 2: (CInode::start_scatter(ScatterLock*)+0xaa) [0x60dc1a]
 3: (Locker::scatter_mix(ScatterLock*, bool*)+0x1ca) [0x589a9a]
 4: (Locker::wrlock_start(SimpleLock*, MDRequest*, bool)+0x165) [0x597d65]
 5: (MDCache::predirty_journal_parents(Mutation*, EMetaBlob*, CInode*, CDir*, int, int, snapid_t)+0x153e) [0x55a70e]
 6: (Locker::scatter_writebehind(ScatterLock*)+0x42d) [0x58553d]
 7: (Locker::simple_lock(SimpleLock*, bool*)+0x7ab) [0x58beeb]
 8: (Locker::scatter_nudge(ScatterLock*, Context*, bool)+0x3ad) [0x58c49d]
 9: (Locker::scatter_tick()+0x28a) [0x58c98a]
 10: (MDS::tick()+0x4e4) [0x4b26a4]
 11: (SafeTimer::timer_thread()+0x22c) [0x6d164c]
 12: (SafeTimerThread::entry()+0xd) [0x6d34bd]
 13: (Thread::_entry_func(void*)+0xa) [0x4943da]
 14: /lib/libpthread.so.0 [0x7fc87810b73a]
 15: (clone()+0x6d) [0x7fc876dad69d]

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agofilestore: make OpSequencer::flush() work for writeahead journaling items
Sage Weil [Fri, 17 Dec 2010 23:12:17 +0000 (15:12 -0800)]
filestore: make OpSequencer::flush() work for writeahead journaling items

It was only waiting for items in the op_queue to complete.  The goal is
to wait for anything we've called queue_transactions(&osr,...) on. If we
do writeahead journaling, though, there might be new ops that are still
journaling but not yet submitted to the fs that are missed.

This adds a journal queue to the OpSequencer, and uses it in the writeahead
case only.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomon: build_initial_monmap: fix mismatched alloc
Colin Patrick McCabe [Fri, 17 Dec 2010 23:31:41 +0000 (15:31 -0800)]
mon: build_initial_monmap: fix mismatched alloc

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agocommon: cleanups
Colin Patrick McCabe [Fri, 17 Dec 2010 23:06:40 +0000 (15:06 -0800)]
common: cleanups

common_init: avoid (mismatched) heap allocation

ConfFile::_parse: avoid memory leak on error path

ConfFile: NULL filename if not set, rather than leaving it undefined

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoosd: PG::choose_acting: fix major iterator mistake
Colin Patrick McCabe [Fri, 17 Dec 2010 23:05:56 +0000 (15:05 -0800)]
osd: PG::choose_acting: fix major iterator mistake

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agorgw: fix fd leak on error path
Colin Patrick McCabe [Fri, 17 Dec 2010 23:05:36 +0000 (15:05 -0800)]
rgw: fix fd leak on error path

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agohadoop: fix a bunch of mismatched allocations
Colin Patrick McCabe [Fri, 17 Dec 2010 23:04:17 +0000 (15:04 -0800)]
hadoop: fix a bunch of mismatched allocations

Using array new means you need array delete.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoauth: avoid mismatched allocation
Colin Patrick McCabe [Fri, 17 Dec 2010 23:03:37 +0000 (15:03 -0800)]
auth: avoid mismatched allocation

Can't pair strdup and free.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoosd: flush pg writes to disk before starting scrub scan
Sage Weil [Fri, 17 Dec 2010 20:54:38 +0000 (12:54 -0800)]
osd: flush pg writes to disk before starting scrub scan

This avoids two races:
 - we just completed recovery by pushing objects to the replica, and the
   replica starts scanning before those writes reach the fs.
 - we just trimmed to something after last_update_applied.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agofilestore: add per-sequencer flush operation
Sage Weil [Fri, 17 Dec 2010 20:51:19 +0000 (12:51 -0800)]
filestore: add per-sequencer flush operation

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: debug scan_list and scrub a bit better
Sage Weil [Fri, 17 Dec 2010 20:51:03 +0000 (12:51 -0800)]
osd: debug scan_list and scrub a bit better

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: clear INCONSISTENT if scrub detects no errors
Sage Weil [Fri, 17 Dec 2010 18:59:45 +0000 (10:59 -0800)]
osd: clear INCONSISTENT if scrub detects no errors

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: add assert that we're replica
Sage Weil [Fri, 17 Dec 2010 18:36:34 +0000 (10:36 -0800)]
osd: add assert that we're replica

ar Fred saw a crash where we got into merge_log as a stray, which really
shouldn't ever happen!  See #590.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agodebian: don't strip rados classes
Laszlo Boszormenyi [Fri, 17 Dec 2010 16:31:00 +0000 (08:31 -0800)]
debian: don't strip rados classes

Signed-off-by: Laszlo Boszormenyi <gcs@debian.hu>
Signed-off-by: Sage Weil <sage@newdream.net>
14 years agodebian: rename ceph.lintian -> ceph.lintian-overrides
Laszlo Boszormenyi [Fri, 17 Dec 2010 16:30:03 +0000 (08:30 -0800)]
debian: rename ceph.lintian -> ceph.lintian-overrides

Signed-off-by: Laszlo Boszormenyi <gcs@debian.hu>
Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoPG.cc:
Samuel Just [Thu, 16 Dec 2010 21:06:43 +0000 (13:06 -0800)]
PG.cc:
sub_op_scrub must set finalizing_scrub on the replica
before waiting for last_update_applied to catch up to
info.last_update.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoReplicatedPG.cc:
Samuel Just [Wed, 15 Dec 2010 20:57:10 +0000 (12:57 -0800)]
ReplicatedPG.cc:
_scrub must set head when it encounters a head snap
curclone counts down, not up

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agofilestore: detect final version of async ioctl SNAP_CREATE_V2
Sage Weil [Sat, 11 Dec 2010 00:26:06 +0000 (16:26 -0800)]
filestore: detect final version of async ioctl SNAP_CREATE_V2

Li's revised interface for the async snap ioctl is more flexible.  Update
the ioctl call sites and detection code accordingly.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: Save straydn in mdr so it's consistent across retry attempts.
Greg Farnum [Wed, 15 Dec 2010 18:56:18 +0000 (10:56 -0800)]
mds: Save straydn in mdr so it's consistent across retry attempts.

Otherwise, we could choose new stray dirs and fail to get all
the locks we needed (while leaving old strays locked forever!).

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agomon: trim pgmap less aggressively
Sage Weil [Tue, 14 Dec 2010 17:26:12 +0000 (09:26 -0800)]
mon: trim pgmap less aggressively

This will make observer crashes due to missed states (#648) much harder to
hit.  Eventually the pgmap state trim problem will go away when the
monitor/paxos code is restructured (#647).

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agocrypto: catch cryptopp decrypt/encrypt exceptions
Yehuda Sadeh [Tue, 14 Dec 2010 18:51:08 +0000 (10:51 -0800)]
crypto: catch cryptopp decrypt/encrypt exceptions

14 years agoosd: PG::prior_set_affected: const cleanup
Colin Patrick McCabe [Tue, 14 Dec 2010 09:53:37 +0000 (01:53 -0800)]
osd: PG::prior_set_affected: const cleanup

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agomds: fix replay/resent vs completed request check
Sage Weil [Sun, 12 Dec 2010 22:39:48 +0000 (14:39 -0800)]
mds: fix replay/resent vs completed request check

If it is a _replayed_ request, we should always send a simple ack if it is
completed, because the client doesn't not care about any additional caps.

If it is a _resent_ request, then we want to return useful caps on open or
create requests, even if any modification side-effects have already been
committed.  The additional checks for completed already exist in the
create and open handlers.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agorpm: update changelog
Colin Patrick McCabe [Thu, 9 Dec 2010 22:38:08 +0000 (14:38 -0800)]
rpm: update changelog

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agorpm: fix ceph.spec to work with gcephtool
Colin Patrick McCabe [Thu, 9 Dec 2010 19:46:28 +0000 (11:46 -0800)]
rpm: fix ceph.spec to work with gcephtool

Don't try to package gui_resources unless we are building the GUI.
Get GUI dependencies correct.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoFix overflow in FileJournal::_open_file()
Vangelis Koukis [Thu, 9 Dec 2010 18:53:22 +0000 (20:53 +0200)]
Fix overflow in FileJournal::_open_file()

[ The following text is in the "iso-8859-7" character set. ]
    [ Your display is set for the "iso-8859-1" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Running the unstable branch, mkcephfs fails when trying to create
a 3GB journal file on the OSDs.

Relevant messages from the osd logfile:

2010-12-09 19:03:54.419737 7fdde4d51720 journal _open_file: unable to extend journal to 18446744072560312320 bytes
2010-12-09 19:03:54.419789 7fdde4d51720 filestore(/osd) mkjournal error creating journal on /osd/journal

The problem is that the calculation of the journal size in bytes
overflows, in FileJournal::_open_file().

Signed-off-by: Vangelis Koukis <vkoukis@cslab.ece.ntua.gr>
Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoReplicatedPG.cc: Fixes a bug in snap_trimmer where a pointer to a stack
Samuel Just [Thu, 9 Dec 2010 18:25:39 +0000 (10:25 -0800)]
ReplicatedPG.cc: Fixes a bug in snap_trimmer where a pointer to a stack
Cond is left in the mode.waiting_cond list.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoReplicatedPG: snap_trimmer now acquires a read lock on the osd map
Samuel Just [Thu, 9 Dec 2010 18:24:34 +0000 (10:24 -0800)]
ReplicatedPG: snap_trimmer now acquires a read lock on the osd map
before calling share_pg_info.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agorpm: don't try to package radosacl
Colin Patrick McCabe [Thu, 9 Dec 2010 18:59:57 +0000 (10:59 -0800)]
rpm: don't try to package radosacl

radosacl is just a test binary, so unless we build with --with-debug, we
won't get it.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agorpm: add pkgconfig to BuildRequires
Colin Patrick McCabe [Thu, 9 Dec 2010 18:39:34 +0000 (10:39 -0800)]
rpm: add pkgconfig to BuildRequires

You can't build without pkgconfig.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agorpm: set files-attr for radosgw
Colin Patrick McCabe [Thu, 9 Dec 2010 18:26:55 +0000 (10:26 -0800)]
rpm: set files-attr for radosgw

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agofilejournal: reset last_commited_seq if we find journal to be invalid
Sage Weil [Wed, 8 Dec 2010 23:53:13 +0000 (15:53 -0800)]
filejournal: reset last_commited_seq if we find journal to be invalid

If we read an event that's later than our expected entry, we set read_pos
to -1 and discard the journal.  If that happens we also need to reset
last_committed_seq to avoid a crash like

2010-12-08 17:04:39.246950 7f269d138910 journal commit_finish thru 16904
2010-12-08 17:04:39.246961 7f269d138910 journal committed_thru 16904 < last_committed_seq 37778589
os/FileJournal.cc: In function 'virtual void FileJournal::committed_thru(uint64_t)':
os/FileJournal.cc:854: FAILED assert(seq >= last_committed_seq)
 ceph version 0.24~rc (commit:fe10300317383ec29948d7dbe3cb31b3aa277e3c)
 1: (FileJournal::committed_thru(unsigned long)+0xad) [0x588e7d]
 2: (JournalingObjectStore::commit_finish()+0x8c) [0x57f2ec]
 3: (FileStore::sync_entry()+0xcff) [0x5764cf]
 4: (FileStore::SyncThread::entry()+0xd) [0x506d9d]
 5: (Thread::_entry_func(void*)+0xa) [0x4790ba]
 6: /lib/libpthread.so.0 [0x7f26a2f8373a]
 7: (clone()+0x6d) [0x7f26a1c2569d]

Fixes #631

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomon: use helper for clock drift check; log relative instead of absolute time
Sage Weil [Wed, 8 Dec 2010 19:12:51 +0000 (11:12 -0800)]
mon: use helper for clock drift check; log relative instead of absolute time

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: sync->mix replica state is sync->mix(2)
Sage Weil [Tue, 7 Dec 2010 21:31:01 +0000 (13:31 -0800)]
mds: sync->mix replica state is sync->mix(2)

When auth first moves to sync->mix,
 - auth sends AC_MIX to replicas
 - replicas go to sync->mix
 - replicas finish gather, send AC_SYNCACK, move to sync->mix(2)
 - auth gets all acks, sends AC_MIX again
 - replica moves to MIX

So any new replica should just get sync->mix(2), so that it is not confused
by the second AC_MIX.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: no not choose lock state on replicas
Sage Weil [Tue, 7 Dec 2010 20:50:15 +0000 (12:50 -0800)]
mds: no not choose lock state on replicas

The lock state has already been set during rejoin.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: small rejoin cleanup
Sage Weil [Tue, 7 Dec 2010 20:45:04 +0000 (12:45 -0800)]
mds: small rejoin cleanup

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: rev mds cluster internal protocol
Sage Weil [Tue, 7 Dec 2010 19:26:24 +0000 (11:26 -0800)]
mds: rev mds cluster internal protocol

The lock encoding changed with the dirty bit on scatterlocks.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: fix replay of already-journaled requests
Sage Weil [Tue, 7 Dec 2010 19:21:39 +0000 (11:21 -0800)]
mds: fix replay of already-journaled requests

Check for already-completed tids for both retried and replayed requests.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: open undef dirfrags during rejoin
Sage Weil [Tue, 7 Dec 2010 19:15:56 +0000 (11:15 -0800)]
mds: open undef dirfrags during rejoin

Any invented dirfrags have a version of 0.  This will cause problems later
if we pre_dirty() anything in that dir because the dir version won't be
in sync (it'll be way too small).  Also, we can do that at any point,
e.g. when flushing dirty caps, and aren't allowed to delay, so we need to
load those dirfrags now.

In theory we could read only the fnode and not all the dentries, but we
may as well.  We should be more careful about memory that this patch is,
though.

Fixes #15.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: add missing try_clear_more() to scatterlock
Sage Weil [Tue, 7 Dec 2010 18:47:58 +0000 (10:47 -0800)]
mds: add missing try_clear_more() to scatterlock

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: explicitly pass scatterlock dirty flag to auth on gather
Sage Weil [Tue, 7 Dec 2010 18:47:30 +0000 (10:47 -0800)]
mds: explicitly pass scatterlock dirty flag to auth on gather

This ensures that if the replica is thinks it is flushing something the
auth will always do a scatter_writebehind.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: send LOCKFLUSHED to trigger finish_flush on replicas
Sage Weil [Tue, 7 Dec 2010 17:06:47 +0000 (09:06 -0800)]
mds: send LOCKFLUSHED to trigger finish_flush on replicas

Since f741766a we have triggered start_flush and finish_flush on replicas.
The problem is that the finish_flush didn't always happen for the mix->lock
case: we sould start_flush when we sent the AC_LOCKACK, but could only
finish_flush if/when we got another SYNC or MIX.  If the primary stayed in
the LOCK state, we would keep our flushing flag.  That in turn causes
problems later when we try to eval_gather() (esp if we are auth at that
point?).

Fix this by sending an explicit AC_LOCKFLUSHED message to replicas after
we do a scatter_writebehind.  The replica will only set flushing if it
flushed dirty data, which forces scatter_writebehind, so we will always
get the LOCKFLUSHED to match.  Replicas that didn't flush will also get
it, but oh well.  We'd need to keep track which ones sent dirty data to
do that properly, though.

TODO: still need to verify that this is correct for rejoin.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: clear EXPORTINGCAPS on export_reverse
Sage Weil [Tue, 7 Dec 2010 15:58:01 +0000 (07:58 -0800)]
mds: clear EXPORTINGCAPS on export_reverse

We need to reverse the effects of encode_export_inode_caps(), which is just
the pin and state bit.

The original problem can be reproduced with
 - ceph tell mds 0 injectargs '--mds-kill-import-at 5'
 - restart mds
 - recovery completes successfully
 - wait for the subtree to be reexported
 - fail with bad EXPORTINGCAPS get in encode_export_inode_caps

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: fix LOOKUPHASH to avoid creating bogus replica CDir
Sage Weil [Tue, 7 Dec 2010 00:31:56 +0000 (16:31 -0800)]
mds: fix LOOKUPHASH to avoid creating bogus replica CDir

We can't create the CDir if we are non-auth.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: introduce rejoin_invent_dirfrag() helper
Sage Weil [Mon, 6 Dec 2010 22:34:36 +0000 (14:34 -0800)]
mds: introduce rejoin_invent_dirfrag() helper

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoautomake: in scripts, use sysconfdir as-is
Colin Patrick McCabe [Tue, 7 Dec 2010 18:56:05 +0000 (10:56 -0800)]
automake: in scripts, use sysconfdir as-is

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoautomake: in deb pkg, use --syconfdir=/etc
Colin Patrick McCabe [Tue, 7 Dec 2010 18:48:19 +0000 (10:48 -0800)]
automake: in deb pkg, use --syconfdir=/etc

When building the debian packages, use --sysconfdir=/etc.

Also, don't fudge sysconfdir in the init-ceph script.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agomkcephfs: require -k; update man page
Sage Weil [Tue, 7 Dec 2010 06:17:47 +0000 (22:17 -0800)]
mkcephfs: require -k; update man page

Force users to specify keyring location; update man page accordingly.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoconfigure: detect crypto++ library
Yehuda Sadeh [Mon, 6 Dec 2010 23:25:08 +0000 (15:25 -0800)]
configure: detect crypto++ library