Sage Weil [Fri, 30 Mar 2012 16:51:45 +0000 (09:51 -0700)]
filestore: set guard on collection_move
During recovery we submit transactions like:
- delete a/foo
- move tmp/foo to a/foo
This prevents the EEXIST check in collection_move from doing any good,
since the destination never exists. We need to do that remove at least
sometimes, because we may be overwriting an existing/older version of the
object.
So,
- set the guard after we do the move, so that
- the delete won't be repated, and
- the EEXIST check will work
Also check the guard for good measure (although that doesn't do anything
specifically useful in this scenario).
Fixes: #2164 Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Josh Durgin <josh.durgin@dreamhost.com> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
Alexandre Oliva [Thu, 22 Mar 2012 19:23:02 +0000 (16:23 -0300)]
don't override CFLAGS
leveldb adds -I flags to CFLAGS and CXXFLAGS, but if these macros are
overridden in the make command line, the flags are dropped, and the
build fails. leveldb should probably use AM_CFLAGS instead, but the
spec file can specify the preferred CFLAGS in the configure command
line, and then everything will work as expected.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br> Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sun, 18 Mar 2012 16:08:15 +0000 (09:08 -0700)]
osd: fix object_info.size mismatch file due to truncate_seq on new object
If the first write that creates an object includes a truncate_seq and
truncate_size, we were taking the truncte patch and doing a truncate op
in our transaction prior to the write, and then setting the object_info
size appropriately. However, if the object doesn't exist, the truncate
op fails even though the oi.size gets set.
Later, this turns up as a scrub error (see #2080).
Fix this by skipping the truncate if it is a new object. Instead, we
should just initialize our truncate_{seq,size} metadata so that we're all
up to date for any later writes.
Alternatively, we could touch the object and then truncate it (up) to the
large size, but this is sort of a waste; data beyond a short object eof is
defined to be zeros, so all we would accomplish is making recovery work
harder by copying zeros around.
Fixes: #2080 Signed-off-by: Sage Weil <sage.weil@dreamhost.com> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Fri, 16 Mar 2012 20:07:25 +0000 (13:07 -0700)]
osd: explicitly create new object,snap contexts on push
We specifically want to use this during recovery to avoid loading the obc
or ssc for a previous version of the object and populating the watchers.
We know we won't have any existing obc here because it is missing (old or
dne).
For the snapset context, we provide it explicitly when we recover the head
or snapset object (which we always do first). For clones, we re-use the
existing get_snapset_context(), which will either have the ssc open or
can load it from the head/snapset object.
Sage Weil [Thu, 15 Mar 2012 17:35:40 +0000 (10:35 -0700)]
osd: maybe clear DEGRADED on recovery completion
We set degraded if we don't have enough "active" replicas, which excludes
the backfill target. We need to recheck that when we finish recovery and
the backfill target is now complete.
Fixes: #2160 Signed-off-by: Sage Weil <sage.weil@dreamhost.com> Reviewed-by: Josh Durgin <josh.durgin@dreamhost.com>
Sage Weil [Wed, 14 Mar 2012 19:14:20 +0000 (12:14 -0700)]
osd: rev cluster internal protocol
This covers:
- the push/pull changes in 0.43 (which we forgot to protect against; see
#2132)
- the new omap stuff for 0.44
Maybe we could make this finer grained so that ceph-osd would fail only
when mismatched versions are talking _and_ there is actual omap data in
play, but it's not worth the effort at this point.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Samuel Just [Mon, 12 Mar 2012 20:33:55 +0000 (13:33 -0700)]
FileStore: ignore ERANGE and ENOENT on replay
The source object may either not exist or be the wrong size
during replay if the destination object was deleted in a future
already-applied operation. This should not impact correctness
of the replay.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Yehuda Sadeh [Mon, 12 Mar 2012 20:15:50 +0000 (13:15 -0700)]
config: alternative config options for global_init()
We want to be able to provide alternative default config values, than
the ones we set in common/config_opts.h. This can be useful when we
want different default for different modules (e.g., rgw, rgw-admin).
Just passing it on the command line won't do because then we'd override
any config set by the user, so we need to process that before the regular
parsing (but after initializing the config context).
Yehuda Sadeh [Mon, 12 Mar 2012 18:39:58 +0000 (11:39 -0700)]
rgw: switch ops log flag to use ceph config
It's turned on by default. So now we're using the
'rgw enable ops log' config param in ceph.conf, instead
of RGW_SHOULD_LOG_DEFAULT in the apache conf.
Sage Weil [Mon, 12 Mar 2012 04:11:37 +0000 (21:11 -0700)]
Makefile: link libfcgi to librgw
Need this to make a linker error go away on my squeeze dev box. We
probably need to make sure librgw doesn't touch fcgi, once that is
revisited down the line. Opened #2166.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sat, 10 Mar 2012 00:34:55 +0000 (16:34 -0800)]
filestore: sync object_map on _set_replay_guard()
We need to sync the object_map too. We can _almost_ check to see if there
are keys for the object and only do it then, except that they may have
existed previously and then been deleted.
So, always sync. leveldb is reasonably nice about this... it should just
be another fsync.
Sage Weil [Thu, 8 Mar 2012 04:58:27 +0000 (20:58 -0800)]
filestore: remove old post-idempotent transaction trigger_commit
The old strategy was to initiate a commit after any non-idempotent
transaction. This only worked if the transaction was idempotent with
respect to itself, or could be replayed partially without problems,
and in reality that isn't the case. For example:
- clone A -> B
- write to A
- <sync>
If we crash before the sync, and replay the clone A->B, we corrupt B with
the new A data.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Wed, 7 Mar 2012 18:11:58 +0000 (10:11 -0800)]
filestore: fgetxattr helpers/wrappers
Also, do the getxattr using fgetxattr, to avoid duplicating code. This is
slightly slower probably because we open a file handle, but if we care we
should really clean up the code to use lfn_open instead of lfn_find and
avoid the repeated path traversal too.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 9 Mar 2012 21:34:55 +0000 (13:34 -0800)]
osd: fix watch_lock vs map_lock ordering
watch_lock is inside map_lock (and pg->lock), which means we need to
drop it to take pg->lock here. That means verifying in
handle_watch_timeout that we haven't raced with another thread canceling
the timeout event, which would be indicated by
- the entity not appearing in unconnected_watchers
- the entity having a different (presumably newer) expire time
Fixes: #2103 Signed-off-by: Sage Weil <sage.weil@dreamhost.com> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Fri, 9 Mar 2012 20:26:22 +0000 (12:26 -0800)]
osd: update_heartbeat_peers as needed
Before, we were being very careful about updating the heartbeat peers if
new PGs were created or when certain types of messages were received.
However, the PG can change it's peers in lots of cases (e.g., when
recovery completes), but the OSD doesn't re-aggregate.
Instead, set a flag when each PG updates it's set, and check that flag in
the OSD code periodically or in likely places. A call in tick() acts as
a catch-all.
The num_created counts can probably be cleaned out now...
Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Greg Farnum <gregory.farnum@dreamhost.com>
Alex Elder [Thu, 8 Mar 2012 23:16:45 +0000 (15:16 -0800)]
ceph: document the way files are laid out
This adds a document that I wrote about how Ceph client file data
is striped across Ceph objects to the repository. It's a text
document. Someone with better document preparation skills than I
should use the content below as a basis for something prettier if
that's appropriate.
[Made a few edits... -sage]
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Thu, 8 Mar 2012 22:29:42 +0000 (14:29 -0800)]
osd: add zero_to field to PG::OndiskLog; track zeroed region of pg log
Track which region of the log has been zeroed on disk. This may be
different from tail if 'osd preserved trimmed log = false' in the config.
Only zero the portion of the log we need to. This avoids rezeroing regions
or missing bits when 'osd preserved trimmed log' was off and is then turned
on.
Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Thu, 8 Mar 2012 22:30:06 +0000 (14:30 -0800)]
filestore: use FL_ALLOC_PUNCH_HOLE to zero, when available
First try the FL_ALLOC_PUNCH_HOLE fallocate() flag. If we get EOPNOTSUPP,
fall back to writing zeros.
Check for fallocate(2) with configure. Also, avoid this if we are not
Linux, since I'm not sure about the hard-coded FL_ALLOC_PUNCH_HOLE being
correct on other platforms.
Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>