Sage Weil [Tue, 6 Mar 2012 17:19:32 +0000 (09:19 -0800)]
filestore: create snap_0 on mkfs
If we create a new filestore, apply one transaction, and then crash, we
want to make sure roll back to a consistent reference point--empty. The
simplest solution is to create that snap_0 during mkfs. This avoids
strangeness like
2012-02-27 00:42:00.336703 7fb1381ef780 filestore(/ceph/osd.0) mkfs in /ceph/osd.0
2012-02-27 00:42:00.341399 7fb1381ef780 journal _open /ceph/osd.0.journal fd 10: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0
2012-02-27 00:42:00.349705 7fb1381ef780 filestore(/ceph/osd.0) mkjournal created journal on /ceph/osd.0.journal
2012-02-27 00:42:00.349728 7fb1381ef780 filestore(/ceph/osd.0) mkfs done in /ceph/osd.0
2012-02-27 00:42:00.349787 7fb1381ef780 filestore(/ceph/osd.0) mount FIEMAP ioctl is NOT supported
2012-02-27 00:42:00.349800 7fb1381ef780 filestore(/ceph/osd.0) mount detected btrfs
2012-02-27 00:42:00.349813 7fb1381ef780 filestore(/ceph/osd.0) mount btrfs CLONE_RANGE ioctl is supported
2012-02-27 00:42:00.357023 7fb1381ef780 filestore(/ceph/osd.0) mount btrfs SNAP_CREATE is supported
2012-02-27 00:42:00.405174 7fb1381ef780 filestore(/ceph/osd.0) mount btrfs SNAP_DESTROY is supported
2012-02-27 00:42:00.405214 7fb1381ef780 filestore(/ceph/osd.0) mount btrfs START_SYNC got (25) Inappropriate ioctl for device
2012-02-27 00:42:00.405228 7fb1381ef780 filestore(/ceph/osd.0) mount btrfs START_SYNC is NOT supported: (25) Inappropriate ioctl for device
2012-02-27 00:42:00.405235 7fb1381ef780 filestore(/ceph/osd.0) mount WARNING: btrfs snaps enabled, but no SNAP_CREATE_V2 ioctl (from kernel 2.6.37+)
2012-02-27 00:42:00.405561 7fb1381ef780 filestore(/ceph/osd.0) mount found snaps <>
2012-02-27 00:42:00.405576 7fb1381ef780 filestore(/ceph/osd.0) mount WARNING: no consistent snaps found, store may be in inconsistent state
and subsequent badness if we fail before a proper commit is made.
Fixes: #2105 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Mon, 5 Mar 2012 22:21:31 +0000 (14:21 -0800)]
osd: delay non-replayed ops during replay
If we get new (non-replayed) ops during replay, those need to wait until
after the replayed ops are ordered and applied. Otherwise we break the op
ordering completely, particularly with something like
- pg not active
- get op 1, put on waiting_for_active
- pg enters replay
- get op 2, apply immediately
- finish replay, requeue op 1
Fixes: #2082 Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Mon, 5 Mar 2012 22:21:00 +0000 (14:21 -0800)]
osd: don't trust pusher's data_complete
The pusher doesn't know what clone_overlap we'll see, so it has no idea
if we are data_complete from our perspective, making this check useless.
In particular, we screw up if we race with a recalculation of
clone_overlap.
Fixes: #2133 Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
Florian Haas [Sat, 3 Mar 2012 23:40:55 +0000 (00:40 +0100)]
OCF resource agents: add rbd
Add a resource agent for mapping, unmapping and monitoring RBD devices.
Maps an RBD on start, unmaps it on stop. Checks "rbd showmapped"
output for monitoring whether the device is mapped, thus does not
rely on the ceph-rbdnamer udev magic to be enabled.
This RA is cloneable and essentially allows people to use RBD devices
as a drop-in replacement for
- iSCSI devices,
- host-based mirrored devices using md RAID-1,
- DRBD devices
in Pacemaker clusters.
Greg Farnum [Fri, 2 Mar 2012 22:46:06 +0000 (14:46 -0800)]
msgr: start re-ordering functions into a better order
This is the start of making the SimpleMessenger interface legible
to users. In addition to moving the configuration and accessor
functions to the top of the file, it adds virtual to the functions
which are part of the defined Messenger interface.
You can tell from some of the comments that work remains.
Greg Farnum [Thu, 1 Mar 2012 23:14:52 +0000 (15:14 -0800)]
msgr: Remove SimpleMessenger::register_entity
This function has been vestigial for a long time. Remove it and move
its remaining functionality into the constructor.
Update users to the new interface (this is remarkably easy and
simplifies the code).
Sage Weil [Fri, 2 Mar 2012 17:44:04 +0000 (09:44 -0800)]
filestore: fix rollback safety check
There is a window in the old check between when current/commit_op_seq is
written and the snapshot is taken. If ceph-osd crashes, we'll be unable to
start because we'll believe current/ was in use without proper checkpoints.
Instead, make the snapped/not snapped state of current/ explicit.
Fixes: #2118 Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Samuel Just <samuel.just@dreamhost.com> Reviewed-by: Yehuda Sadeh <yehuda.sadeh@dreamhost.com>
Josh Durgin [Tue, 7 Feb 2012 01:59:36 +0000 (17:59 -0800)]
librados: only shutdown objecter after it's initialized
The objecter is only initialized once the RadosClient state is
CONNECTED from the perspective of a RadosClient::shutdown()
caller. Error paths in RadosClient::connect() may call shutdown while
still in the CONNECTING state.
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
Yehuda Sadeh [Wed, 29 Feb 2012 19:34:33 +0000 (11:34 -0800)]
rgw: don't retry certain operations if we raced
The atomic get/put scheme was retrying writes in case where it lost
races (head object was rewritten by another client). Instead we can
just back off and return success.
Sage Weil [Wed, 29 Feb 2012 21:22:34 +0000 (13:22 -0800)]
msgr: fix race in learned_addr()
- two connect() threads
- both hit if (need_addr) check
- one takes lock, sets addr, need_addr = false, unlocks
- continues to ::encode(ms_addr, ...);
- meanwhile, second thread set ms_addr _again_, but copies peer port into
place before adjusting it. racing ::encode() sees bad port and sends it
to the peer.
Fix this two ways:
- don't copy bad port into place; set it first
- re-check need_addr after taking lock
Fixes: #1747 Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Greg Farnum <gregory.farnum@dreamhost.com>
Greg Farnum [Wed, 29 Feb 2012 01:30:23 +0000 (17:30 -0800)]
msgr: discard the local_pipe's queue on shutdown.
To facilitate this, we do two things:
1) actually identify the number of special code values we pass around
2) use that to prevent trying to put() those non-pointer values in
Pipe::discard_queue().
Then we just call local_pipe.discard_queue() in wait() like happens
(indirectly, via reaping) with all the normal Pipes in rank_pipe.
But this does make me think that we may be approaching the point
where it's appropriate to create a subclass LocalPipe (against a
RemotePipe like our current Pipe implementation is mostly intended
to be).
Should fix #2086.
Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
Any time an osd goes down we want to ensure we remove it from peer_info.
Handling this in Reset and Started states captures all of the nested
states, which forward the event (or re-post transit to Reset). We can
also drop the Primary reaction, which is now superfluous.
Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Samuel Just <samuel.just@dreamhost.com> Reviewed-by: Josh Durgin <josh.durgin@dreamhost.com>