Sage Weil [Fri, 9 Mar 2012 21:34:55 +0000 (13:34 -0800)]
osd: fix watch_lock vs map_lock ordering
watch_lock is inside map_lock (and pg->lock), which means we need to
drop it to take pg->lock here. That means verifying in
handle_watch_timeout that we haven't raced with another thread canceling
the timeout event, which would be indicated by
- the entity not appearing in unconnected_watchers
- the entity having a different (presumably newer) expire time
Fixes: #2103 Signed-off-by: Sage Weil <sage.weil@dreamhost.com> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Fri, 9 Mar 2012 20:26:22 +0000 (12:26 -0800)]
osd: update_heartbeat_peers as needed
Before, we were being very careful about updating the heartbeat peers if
new PGs were created or when certain types of messages were received.
However, the PG can change it's peers in lots of cases (e.g., when
recovery completes), but the OSD doesn't re-aggregate.
Instead, set a flag when each PG updates it's set, and check that flag in
the OSD code periodically or in likely places. A call in tick() acts as
a catch-all.
The num_created counts can probably be cleaned out now...
Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Greg Farnum <gregory.farnum@dreamhost.com>
Alex Elder [Thu, 8 Mar 2012 23:16:45 +0000 (15:16 -0800)]
ceph: document the way files are laid out
This adds a document that I wrote about how Ceph client file data
is striped across Ceph objects to the repository. It's a text
document. Someone with better document preparation skills than I
should use the content below as a basis for something prettier if
that's appropriate.
[Made a few edits... -sage]
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Thu, 8 Mar 2012 22:29:42 +0000 (14:29 -0800)]
osd: add zero_to field to PG::OndiskLog; track zeroed region of pg log
Track which region of the log has been zeroed on disk. This may be
different from tail if 'osd preserved trimmed log = false' in the config.
Only zero the portion of the log we need to. This avoids rezeroing regions
or missing bits when 'osd preserved trimmed log' was off and is then turned
on.
Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Thu, 8 Mar 2012 22:30:06 +0000 (14:30 -0800)]
filestore: use FL_ALLOC_PUNCH_HOLE to zero, when available
First try the FL_ALLOC_PUNCH_HOLE fallocate() flag. If we get EOPNOTSUPP,
fall back to writing zeros.
Check for fallocate(2) with configure. Also, avoid this if we are not
Linux, since I'm not sure about the hard-coded FL_ALLOC_PUNCH_HOLE being
correct on other platforms.
Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Wed, 7 Mar 2012 16:56:17 +0000 (08:56 -0800)]
osd: make degraded pgs count missing replicas as degraded objects
If a PG is smaller than it should be, make sure the missing replicas are
included in the degraded object count. This makes the overall degraded
percentage consistently meaningful even for PGs that aren't mid-recovery
of mid-backfill.
Fixes: #2137 Signed-off-by: Sage Weil <sage.weil@dreamhost.com> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Tue, 6 Mar 2012 17:19:32 +0000 (09:19 -0800)]
filestore: create snap_0 on mkfs
If we create a new filestore, apply one transaction, and then crash, we
want to make sure roll back to a consistent reference point--empty. The
simplest solution is to create that snap_0 during mkfs. This avoids
strangeness like
2012-02-27 00:42:00.336703 7fb1381ef780 filestore(/ceph/osd.0) mkfs in /ceph/osd.0
2012-02-27 00:42:00.341399 7fb1381ef780 journal _open /ceph/osd.0.journal fd 10: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0
2012-02-27 00:42:00.349705 7fb1381ef780 filestore(/ceph/osd.0) mkjournal created journal on /ceph/osd.0.journal
2012-02-27 00:42:00.349728 7fb1381ef780 filestore(/ceph/osd.0) mkfs done in /ceph/osd.0
2012-02-27 00:42:00.349787 7fb1381ef780 filestore(/ceph/osd.0) mount FIEMAP ioctl is NOT supported
2012-02-27 00:42:00.349800 7fb1381ef780 filestore(/ceph/osd.0) mount detected btrfs
2012-02-27 00:42:00.349813 7fb1381ef780 filestore(/ceph/osd.0) mount btrfs CLONE_RANGE ioctl is supported
2012-02-27 00:42:00.357023 7fb1381ef780 filestore(/ceph/osd.0) mount btrfs SNAP_CREATE is supported
2012-02-27 00:42:00.405174 7fb1381ef780 filestore(/ceph/osd.0) mount btrfs SNAP_DESTROY is supported
2012-02-27 00:42:00.405214 7fb1381ef780 filestore(/ceph/osd.0) mount btrfs START_SYNC got (25) Inappropriate ioctl for device
2012-02-27 00:42:00.405228 7fb1381ef780 filestore(/ceph/osd.0) mount btrfs START_SYNC is NOT supported: (25) Inappropriate ioctl for device
2012-02-27 00:42:00.405235 7fb1381ef780 filestore(/ceph/osd.0) mount WARNING: btrfs snaps enabled, but no SNAP_CREATE_V2 ioctl (from kernel 2.6.37+)
2012-02-27 00:42:00.405561 7fb1381ef780 filestore(/ceph/osd.0) mount found snaps <>
2012-02-27 00:42:00.405576 7fb1381ef780 filestore(/ceph/osd.0) mount WARNING: no consistent snaps found, store may be in inconsistent state
and subsequent badness if we fail before a proper commit is made.
Fixes: #2105 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Yehuda Sadeh [Fri, 2 Mar 2012 01:13:43 +0000 (17:13 -0800)]
rgw: basic functionality of new atomic get/put works
get/put of objects works. Stuff that is known to be broken:
copy object
Also, going through the code, we can probably improve object
reading (use aio). We can also keep the manifest information on
the handle so that we don't need to get_obj_state every iteration.