Sage Weil [Thu, 8 Mar 2012 22:29:42 +0000 (14:29 -0800)]
osd: add zero_to field to PG::OndiskLog; track zeroed region of pg log
Track which region of the log has been zeroed on disk. This may be
different from tail if 'osd preserved trimmed log = false' in the config.
Only zero the portion of the log we need to. This avoids rezeroing regions
or missing bits when 'osd preserved trimmed log' was off and is then turned
on.
Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Thu, 8 Mar 2012 22:30:06 +0000 (14:30 -0800)]
filestore: use FL_ALLOC_PUNCH_HOLE to zero, when available
First try the FL_ALLOC_PUNCH_HOLE fallocate() flag. If we get EOPNOTSUPP,
fall back to writing zeros.
Check for fallocate(2) with configure. Also, avoid this if we are not
Linux, since I'm not sure about the hard-coded FL_ALLOC_PUNCH_HOLE being
correct on other platforms.
Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Wed, 7 Mar 2012 16:56:17 +0000 (08:56 -0800)]
osd: make degraded pgs count missing replicas as degraded objects
If a PG is smaller than it should be, make sure the missing replicas are
included in the degraded object count. This makes the overall degraded
percentage consistently meaningful even for PGs that aren't mid-recovery
of mid-backfill.
Fixes: #2137 Signed-off-by: Sage Weil <sage.weil@dreamhost.com> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Tue, 6 Mar 2012 17:19:32 +0000 (09:19 -0800)]
filestore: create snap_0 on mkfs
If we create a new filestore, apply one transaction, and then crash, we
want to make sure roll back to a consistent reference point--empty. The
simplest solution is to create that snap_0 during mkfs. This avoids
strangeness like
2012-02-27 00:42:00.336703 7fb1381ef780 filestore(/ceph/osd.0) mkfs in /ceph/osd.0
2012-02-27 00:42:00.341399 7fb1381ef780 journal _open /ceph/osd.0.journal fd 10: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0
2012-02-27 00:42:00.349705 7fb1381ef780 filestore(/ceph/osd.0) mkjournal created journal on /ceph/osd.0.journal
2012-02-27 00:42:00.349728 7fb1381ef780 filestore(/ceph/osd.0) mkfs done in /ceph/osd.0
2012-02-27 00:42:00.349787 7fb1381ef780 filestore(/ceph/osd.0) mount FIEMAP ioctl is NOT supported
2012-02-27 00:42:00.349800 7fb1381ef780 filestore(/ceph/osd.0) mount detected btrfs
2012-02-27 00:42:00.349813 7fb1381ef780 filestore(/ceph/osd.0) mount btrfs CLONE_RANGE ioctl is supported
2012-02-27 00:42:00.357023 7fb1381ef780 filestore(/ceph/osd.0) mount btrfs SNAP_CREATE is supported
2012-02-27 00:42:00.405174 7fb1381ef780 filestore(/ceph/osd.0) mount btrfs SNAP_DESTROY is supported
2012-02-27 00:42:00.405214 7fb1381ef780 filestore(/ceph/osd.0) mount btrfs START_SYNC got (25) Inappropriate ioctl for device
2012-02-27 00:42:00.405228 7fb1381ef780 filestore(/ceph/osd.0) mount btrfs START_SYNC is NOT supported: (25) Inappropriate ioctl for device
2012-02-27 00:42:00.405235 7fb1381ef780 filestore(/ceph/osd.0) mount WARNING: btrfs snaps enabled, but no SNAP_CREATE_V2 ioctl (from kernel 2.6.37+)
2012-02-27 00:42:00.405561 7fb1381ef780 filestore(/ceph/osd.0) mount found snaps <>
2012-02-27 00:42:00.405576 7fb1381ef780 filestore(/ceph/osd.0) mount WARNING: no consistent snaps found, store may be in inconsistent state
and subsequent badness if we fail before a proper commit is made.
Fixes: #2105 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Yehuda Sadeh [Fri, 2 Mar 2012 01:13:43 +0000 (17:13 -0800)]
rgw: basic functionality of new atomic get/put works
get/put of objects works. Stuff that is known to be broken:
copy object
Also, going through the code, we can probably improve object
reading (use aio). We can also keep the manifest information on
the handle so that we don't need to get_obj_state every iteration.
Sage Weil [Mon, 5 Mar 2012 22:21:31 +0000 (14:21 -0800)]
osd: delay non-replayed ops during replay
If we get new (non-replayed) ops during replay, those need to wait until
after the replayed ops are ordered and applied. Otherwise we break the op
ordering completely, particularly with something like
- pg not active
- get op 1, put on waiting_for_active
- pg enters replay
- get op 2, apply immediately
- finish replay, requeue op 1
Fixes: #2082 Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Mon, 5 Mar 2012 22:21:00 +0000 (14:21 -0800)]
osd: don't trust pusher's data_complete
The pusher doesn't know what clone_overlap we'll see, so it has no idea
if we are data_complete from our perspective, making this check useless.
In particular, we screw up if we race with a recalculation of
clone_overlap.
Fixes: #2133 Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
Florian Haas [Sat, 3 Mar 2012 23:40:55 +0000 (00:40 +0100)]
OCF resource agents: add rbd
Add a resource agent for mapping, unmapping and monitoring RBD devices.
Maps an RBD on start, unmaps it on stop. Checks "rbd showmapped"
output for monitoring whether the device is mapped, thus does not
rely on the ceph-rbdnamer udev magic to be enabled.
This RA is cloneable and essentially allows people to use RBD devices
as a drop-in replacement for
- iSCSI devices,
- host-based mirrored devices using md RAID-1,
- DRBD devices
in Pacemaker clusters.
Greg Farnum [Fri, 2 Mar 2012 22:46:06 +0000 (14:46 -0800)]
msgr: start re-ordering functions into a better order
This is the start of making the SimpleMessenger interface legible
to users. In addition to moving the configuration and accessor
functions to the top of the file, it adds virtual to the functions
which are part of the defined Messenger interface.
You can tell from some of the comments that work remains.