IoCtx::from_rados_ioctx_t creates an IoCtx out of a rados_ioctx_t.
However, this IoCtx must share ownership of the IoCtxImpl pointer with
the C API user who first called rados_ioctx_create. This must be done
via a reference count inside the IoCtxImpl.
Also add a copy constructor and assignment operator to class IoCtx,
since it's now cheap to have them.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Samuel Just [Tue, 1 Mar 2011 01:07:57 +0000 (17:07 -0800)]
PG: unify scrub_received_maps and peer_scrub_maps
Previously, incoming maps were placed into peer_scrub_maps and merged
into scrub_received_maps during scrub_gather_replica_maps. Now,
sub_op_scrub_map merges the maps into scrub_received_maps directly.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Samuel Just [Fri, 11 Feb 2011 22:46:05 +0000 (14:46 -0800)]
PG: refactor scrubmap comparison and repair logic
The previous version gave erroneous results. This version seems simpler
and can be more easily unit tested as the error detection logic has been
seperated from the repair logic.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Samuel Just [Thu, 10 Feb 2011 21:49:15 +0000 (13:49 -0800)]
PG: replica_scrub also should not block
As with scrub, replica scrub wait()ed for last_update_complete to catch
up to last_update. Now, it will requeue the message when that condition
is satisfied.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Samuel Just [Thu, 10 Feb 2011 20:35:35 +0000 (12:35 -0800)]
PG: make scrub non-blocking
Previously, scrub would block using wait until
1. last_update_applied==last_update and
2. all replica scrub maps are up-to-date
1. is now handled by requeueing scrub once last_update_applied catches
up to last_update. (see op_applied and scrub)
2. is handled in scrub_finalize. scrub_finalize will be scheduled using
the scrub_finalize_wq once scrub_waiting_on hits 0. (see scrub and
sub_op_scrub_map)
scrub_finalize also handles comparing the maps and reporting/repairing
errors.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Samuel Just [Thu, 10 Feb 2011 20:27:11 +0000 (12:27 -0800)]
OSD: add scrub_finalize_wq
Scrub currently blocks while waiting on replica maps and for
last_update_applied==last_update. Also, the subsequent checking of the
primary and replica maps is done while occupying a disk_tp slot. The
scrubmap checking will be moved into scrub_finalize and performed in
this wq in the op_tp.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
The headers and ceph_fs.cc are written such that they can be shared
verbatim between the kernel and userspace code. Omitting the headers
was deliberate, because they differ depending on the build environment.
The default file layout seems fine in config.cc, since it is declared
in config.h, and is a bunch of tunables we generally try to keep in
config.cc.
The previous change changed all PoolHandle uses to IoContext. This
change also renames the variable names.
Also fix a few API functions whose names weren't quite right after the
previous change. rados_pool_list really does just list pools-- it has
nothing to do with ioctxes.
rados_ioctx_change_auid should be rados_ioctx_pool_set_auid. Although it
takes an ioctx as an argument, it operates on the pool.
rados_ioctx_close should just return void. APIs where the close
operation can fail are broken. What is the user supposed to do if
closing doesn't work?
Also, fix a few test programs that got overlooked earlier.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Samuel Just [Thu, 24 Feb 2011 20:31:58 +0000 (12:31 -0800)]
FileStore.h: reorder queue operations in _journaled_ahead
In writeahead mode, an op could dissappear from jq without immediately
reappearing in q. Thus, q can be empty before seq is requeued and
finished. _journaled_ahead will now enqueue the op in q before removing
from jq.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
This commit introduced an error in parallel journaling mode.
OpSequencer::flush is only meant to ensure that the ops have become
readable, not necessarily journalled.
Samuel Just [Thu, 24 Feb 2011 20:31:58 +0000 (12:31 -0800)]
FileStore.h: reorder queue operations in _journaled_ahead
In writeahead mode, an op could dissappear from jq without immediately
reappearing in q. Thus, q can be empty before seq is requeued and
finished. _journaled_ahead will now enqueue the op in q before removing
from jq.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
This commit introduced an error in parallel journaling mode.
OpSequencer::flush is only meant to ensure that the ops have become
readable, not necessarily journalled.
Sage Weil [Wed, 23 Feb 2011 22:25:06 +0000 (14:25 -0800)]
mds: fix export cancellation vs nested freezes
Prevent freezes from completing while we are canceling exports. Otherwise
if we are freezing /a/b and /a, and cancel /a/b, we may inadvertantly
complete the freeze on /a (synchronously) and confuse ourselves. Pin
all freezes beforehand so that when we cancel each one we do not cause
any others to prematurely complete.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 24 Feb 2011 13:49:29 +0000 (05:49 -0800)]
Makefile: fix libatomic_ops linking
LDADD seems to have no effect on the final link command. Switching this
back to AM_LDFLAGS. This was changed as in 1c7d8f1ac2c, although it's not
clear that the change was intentional...
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 24 Feb 2011 08:34:37 +0000 (00:34 -0800)]
mds: remove "N stopped" from short mdsmap summary
It's confusing because it sounds like we're talking about daemons, when we
really just mean there are some ranks that created some ondisk state but
aren't currently part of the running cluster.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Wed, 23 Feb 2011 23:08:58 +0000 (15:08 -0800)]
mds: strengthen assertions in rejoin ack
The ACK only contains items we asked for with a WEAK request. Assert as
much. (The old continue bits were from ~2007, when this was originally
written.)
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Samuel Just [Wed, 23 Feb 2011 21:55:43 +0000 (13:55 -0800)]
FileStore: fix OpSequencer::flush error
In writeahead mode, an op will dissappear from jq without immediately
reappearing in q. Thus, q can be empty before seq is requeued and
finished. last_thru_q and last_thru_jq will now be tracked explicitly.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Wed, 23 Feb 2011 21:34:01 +0000 (13:34 -0800)]
mon: fix dup mds takeover
Allow a standby to take over for a single MDS only by consistently looking
at the pending_mdsmap and not mdsmap. Mixing the two leads to all kinds
of confusion.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Wed, 23 Feb 2011 21:01:08 +0000 (13:01 -0800)]
mds: refragment dirs when inode dirfragtree updates from journal
Force dir fragmentation specified by dirfragtree when replayed from
the journal.
Example:
mds0 is auth for /foo, mds1 is auth for /foo/bar.
mds1 fragments /foo/bar. journals etc.
mds0 gets fragment notify and the in-memory inode's dirfragtree changes.
mds0 journals the /foo/bar inode for some random reason.
mds0 imports /foo/bar.
On replay, mds0 refragments upon first mention of the new fragtree in the
journal, so that the dirfragtree <-> dir frags always match. Confusion is
avoided when we, say, import /foo/bar.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Splt rados_init into rados_create and rados_connect. The pattern will
be for users to call create, set configuration, and then connect. Rename
rados_release to rados_destroy, to be more symmetrical with
rados_create. You can't reconnect after calling destroy.
Don't create the messenger inside the RadosClient constructor. Instead,
wait until RadosClient::connect().
Rename rados_conf_apply to rados_reopen_log. Add comment about SIGHUP.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>