Sage Weil [Fri, 11 Mar 2011 17:23:06 +0000 (09:23 -0800)]
osd: fix osdmap scanning on pg creation
On PG creation we were scanning the complete history of all osdmaps ever.
Fix initialization of PG::Info::History epoch_created and same_*_since
fields in the base (creation) case to make this work the way it was
supposed to.
Reported-by: Yehuda Sadeh <yehuda.sadeh@dreamhost.com> Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 10 Mar 2011 21:06:08 +0000 (13:06 -0800)]
filestore: assert on ENOTEMPTY
ENOTEMPTY implies rmdir failed due to stray crap in the directory. We
should fail now, instead of later when we restart cosd and have stray
data sitting around.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 10 Mar 2011 19:20:54 +0000 (11:20 -0800)]
filer: set RMW bit on probe
Setting the RMW bits on the probe stat call will make the OSD wait for
pending writes on the object to flush to disk. This was a problem for MDS
takeover: the old instance had writes mid-commit on the OSD when the
journal probe/stat came in, and it didn't "see" the size change because
the PG was in 'delayed' mode and the old MDS's write hadn't been queued
for disk locally. Simply claiming we are a RMW op will force the PG
into rmw mode so that we know the prior write will be visible. This is
just fine for the metadata workload.
Fixes: #805 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 10 Mar 2011 18:17:24 +0000 (10:17 -0800)]
osd: fix peer no missing optimization
This shortcut was broken: we need to populate peer_missing with missing
objects in terms of the master log, not the peer's log (which may be old
or even divergent). This shortcut _only_ makes sense when the peer has
no missing in terms of a log that is perfectly up to date (i.e. matches
our last_update).
Reported-by: Henry Chang <henry.cy.chang@gmail.com> Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 10 Mar 2011 17:54:53 +0000 (09:54 -0800)]
osd: fix missing.rm()
The version specifies which version of the object no longer should be
missing. We should thus remove it from the missing set if we needed
anything less than OR EQUAL to that version. (If we are missing something
newer, then missing.rm() is a no-op.)
Make the argument name less weird while we're at it.
Reported-by: Henry Chang <henry.cy.chang@gmail.com> Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Greg Farnum [Tue, 8 Mar 2011 17:36:52 +0000 (09:36 -0800)]
uclient: Clear the CEPH_CAP_FILE_BUFFER ref on _flush, if safe.
Previously we just returned if safe, but leaving the CEPH_CAP_FILE_BUFFER
ref around breaks _fsync horribly. The root cause of this is
update_inode_file_bits calling objectcacher->truncate_set without
clearing the BUFFER ref, but the mechanics of clearing it there are
complicated, and I don't believe there are any issues with keeping
around the extra reference, as long as it's cleared when necessary.
Sage Weil [Tue, 8 Mar 2011 00:25:30 +0000 (16:25 -0800)]
mds: use projected subtree in rename anchor check
We want to (try to) reanchor the directory on rename when our _projected_
subtree is not a leaf. If we use the normal get_subtree_root() call,
we get NULL if we are unlinked, which makes is_leaf_subtree() crash.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Mon, 7 Mar 2011 19:32:20 +0000 (11:32 -0800)]
osd: include all stray peers in might_have_unfound
We should always consider any OSD that has a copy of the PG as a possible
location for missing objects. There are cases where might_have_unfound is
not completed. For example,
- objects on [1,2]
- 2 marked down/out
- objects on [1,3]
- recovery completes, last_epoch_clean is set.
- 2 comes back online
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 4 Mar 2011 21:59:24 +0000 (13:59 -0800)]
osd: include all up peers in might_have_unfound when desperate
If our might_have_unfound calculation was off (it currently can be, see
#865) we could prematurely give up. Try any up OSD at this stage just to
be sure.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 4 Mar 2011 17:39:59 +0000 (09:39 -0800)]
osd: recover_primary if recover_replicas starts no ops
recover_replicas may fail to start anything if we see an unexpected error.
In that case, try recover_primary immediately instead of waiting for the
PG to (hopefully) get requeued for recovery later.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 4 Mar 2011 17:38:47 +0000 (09:38 -0800)]
osd: discover more missing if unfound and do_recovery can't start anything
If we couldn't start any recovery ops and things are still
unfound, see if we can discover more missing object locations.
It may be that our initial locations were bad and we errored
out while trying to pull.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
IoCtx::from_rados_ioctx_t creates an IoCtx out of a rados_ioctx_t.
However, this IoCtx must share ownership of the IoCtxImpl pointer with
the C API user who first called rados_ioctx_create. This must be done
via a reference count inside the IoCtxImpl.
Also add a copy constructor and assignment operator to class IoCtx,
since it's now cheap to have them.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Log a version message whenever we open the dout log, not just the first
time. However, only output it to log files and syslog. Spewing versions
to stderr and stdout was determined to be annoying.
Rename dout_emergency_impl to dout_emergency_to_file_and_syslog to
better reflect its function.
Rename ceph_version_to_string to pretty_version_to_string.
Add get_process_name to do just that. Re-arrange some version.h methods.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Conflicts:
Sage Weil [Thu, 3 Mar 2011 00:13:54 +0000 (16:13 -0800)]
mds: rip out rename linkmerge support
It turns out POSIX says rename(a,b) is a no-op when a and b link to the
same inode. This is super weird but good news because it means we can
rip out a bunch of poorly tested code.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Alexandre Oliva [Wed, 2 Mar 2011 21:39:09 +0000 (13:39 -0800)]
cmds/cosd: Fix IsHeapProfilerRunning implicit return type cast.
G++ complains about the difference between the return type of tcmalloc's
IsHeapProfilerRunning (int) and the return type of the function that
g_conf.profiler_running is supposed to point to (bool). We could
probably get away with a type-cast, but as a compiler developer and
former C++ language lawyer, I'd rather not take the risk of destroying
the universe by invoking undefined behavior ;-)
Sage Weil [Wed, 2 Mar 2011 00:02:48 +0000 (16:02 -0800)]
osd: update missing_loc when infering an empty missing set
We infer an empty missing set, but weren't calculating object locations
based on that. Usually it was okay because we already had another
location, but not always! And especially not when one location turns out
to be bad and we need to go to another.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 1 Mar 2011 23:11:47 +0000 (15:11 -0800)]
osd: add object to missing if we find it missing on disk
If the recovery finds the object missing on disk during recovery, add it
to the local missing set so we can (hopefully) recover it from another
replica.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 1 Mar 2011 00:05:08 +0000 (16:05 -0800)]
osd: trigger discover_all_missing after replay delay
We were calling discover_all_missing only when we went immediately active,
not after we were in the replay state (which triggers from a timer event
that calls OSD::activate_pg(). Move the call into PG::activate() so that
we catch both callers.
This requires passing in a query_map from the caller. While we're at it,
clean up some other instances where we are defining a new query_map
deep within the call tree.
Fixes: #847 (I hope) Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
The headers and ceph_fs.cc are written such that they can be shared
verbatim between the kernel and userspace code. Omitting the headers
was deliberate, because they differ depending on the build environment.
The default file layout seems fine in config.cc, since it is declared
in config.h, and is a bunch of tunables we generally try to keep in
config.cc.
The previous change changed all PoolHandle uses to IoContext. This
change also renames the variable names.
Also fix a few API functions whose names weren't quite right after the
previous change. rados_pool_list really does just list pools-- it has
nothing to do with ioctxes.
rados_ioctx_change_auid should be rados_ioctx_pool_set_auid. Although it
takes an ioctx as an argument, it operates on the pool.
rados_ioctx_close should just return void. APIs where the close
operation can fail are broken. What is the user supposed to do if
closing doesn't work?
Also, fix a few test programs that got overlooked earlier.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Samuel Just [Thu, 24 Feb 2011 20:31:58 +0000 (12:31 -0800)]
FileStore.h: reorder queue operations in _journaled_ahead
In writeahead mode, an op could dissappear from jq without immediately
reappearing in q. Thus, q can be empty before seq is requeued and
finished. _journaled_ahead will now enqueue the op in q before removing
from jq.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
This commit introduced an error in parallel journaling mode.
OpSequencer::flush is only meant to ensure that the ops have become
readable, not necessarily journalled.