Sage Weil [Tue, 10 Jul 2012 01:16:44 +0000 (18:16 -0700)]
mkcephfs: error out if mon data directory is not empty
The ceph-mon --mkfs function no longer wipes out the directory; it is in
fact mostly a no-op that just verifies the dir exists.
So, ensure that the directory is empty at mkfs time. This could
alternatively do an 'rm -r' in that directory (that is in fact what
ceph-mon used to do), but this is safer.
Samuel Just [Mon, 9 Jul 2012 22:53:31 +0000 (15:53 -0700)]
ReplicatedPG: fix replay op ordering
After a client reconnect, the client replays outstanding ops. The
OSD then immediately responds with success if the op has already
committed (version < ReplicatedPG::get_first_in_progress).
Otherwise, we stick it in waiting_for_ondisk to be replied to when
eval_repop concludes that waitfor_disk is empty.
librbd: return an error when removing a non-existent image
Try treating the image as new format if it's not in the old-style
directory, which is the last step in old-style removal. Then if the
image is not found in the new-style directory, -ENOENT will be
returned, preserving the semantics that existed prior to 6f096b6cdc66bb92762aa92e51e5e448039cf3e3.
Sage Weil [Fri, 6 Jul 2012 01:08:58 +0000 (18:08 -0700)]
librados: take lock when signaling notify cond
When we are signaling the cond to indicate that a notify is complete,
take the appropriate lock. This removes the possibility of a race
that loses our signal. (That would be very difficult given that there
are network round trips involved, but this makes the lock/cond usage
"correct.")
Sage Weil [Fri, 6 Jul 2012 02:12:22 +0000 (19:12 -0700)]
cond: assert that we are holding the same mutex as the waiter
Try to verify that we are holding the same mutex that the waiter is
waiting on. Specifically:
* only wait on a single mutex for this cond
* remember which mutex that is
* if we signal and someone has waited, try to make sure we are holding
the mutex as well. (Mutex::is_locked() is unsufficient here; it doesn't
ensure that *our* thread tool the mutex. it is necessary, though!)
Introduce a sloppy_signal() method that can be used if we actually mean
to signal the cond without holding the proper lock (and, presumably,
don't care about losing a signal).
Sage Weil [Fri, 6 Jul 2012 04:26:27 +0000 (21:26 -0700)]
osd: fix PG dtor compile error
We need at least none non-pure virtual method to tell gcc where the
vtable goes. The destructor wins!
libosd.a(libosd_a-ReplicatedPG.o): In function `~PG':
/home/sage/src/ceph/src/osd/PG.h:1367: undefined reference to `vtable for PG'
libosd.a(libosd_a-ReplicatedPG.o):(.rodata._ZTI12ReplicatedPG[typeinfo for ReplicatedPG]+0x10): undefined reference to `typeinfo for PG'
libosd.a(libosd_a-PG.o): In function `PG':
/home/sage/src/ceph/src/osd/PG.cc:85: undefined reference to `vtable for PG'
...
Samuel Just [Thu, 5 Jul 2012 22:39:24 +0000 (15:39 -0700)]
PG,ReplicatedPG: on_removal must handle repop and watcher state
on_removal is now in ReplicatedPG in order to handle watcher state
and repop state. Addionally, workqueue dequeues are handled already
in OSD::_remove_pg.
Samuel Just [Tue, 3 Jul 2012 15:53:54 +0000 (08:53 -0700)]
OSD: clean up revcovery_wq queueing and ref counting
Previously, we tended to explicitely remove the pg from the queue uisng
remove_myself on the xlist::item. This causes us to drop a reference
count. Manipulating the revovery_wq is now accomplished through the
recovery_wq interface, which also handles pg ref counting.
Samuel Just [Mon, 18 Jun 2012 19:52:56 +0000 (12:52 -0700)]
PG: pass activate epoch with Activate event
This allows us to pass into activate() in which epoch the
message triggering activation occurred allowing us mark
the activate committed callback with the right query_epoch.
Samuel Just [Thu, 7 Jun 2012 04:27:38 +0000 (21:27 -0700)]
OSD: on pg_removal, project_pg_history to get current interval
First, we don't really want to remove the pg if we can use it. Second,
there might be messages in the pg peering queue for the next interval.
If one of those happens to be an info request or notify, we would lose
the peering message.
If the message falls in the current interval as determined by the
current osdmap, than we know that any messages currently queued must be
obsolete and can safetly be discarded.
Samuel Just [Sun, 17 Jun 2012 23:16:42 +0000 (16:16 -0700)]
OSD,PG: clean up _get_or_create_pg and set interval based on msg
Previously, we set last_peering_reset based on the epoch in which the pg
is created. We now pass the map from the query_epoch to the creation
methods to set based on that.
Samuel Just [Mon, 18 Jun 2012 19:52:06 +0000 (12:52 -0700)]
PG,OSD: prevent pg from completing peering until deletion is complete
hobject_t must now be globally unique in the filestore. Thus, if we
start creating objects in a pg before the removal collections for the
previous incarnation are fully removed, we might end up a second
instance of the same hobject violating the filestore rules.
Samuel Just [Fri, 29 Jun 2012 21:11:07 +0000 (14:11 -0700)]
OSD,PG: clean up pg removal
PG opsequencers will be used for removing a pg. If the pg is recreated
before the removal is complete, we need the new pg incarnation to be
able to inherit the osr of its predecessor.
Previously, we queued the pg for removal and only rendered it unusable
after the contents were fully removed. Now, we syncronously remove it
from the map and queue a transaction renaming the collections. We then
asyncronously clean up those collections. If the pg is recreated, it
will inherit the same osr until the cleanup is complete ensuring correct
op ordering with respect to the collection rename.
Samuel Just [Wed, 20 Jun 2012 22:42:18 +0000 (15:42 -0700)]
PG: flush ops by the end of peering without osr.flush
Rather than explicitely flushing the filestore, send a noop through the
filestore at the beginning of peering and, at the end, wait for it to
finish by adding an extra state.
Also, delay ops until flushed is true. Until we have finished flushing,
we cannot safetly read objects.
Samuel Just [Thu, 14 Jun 2012 02:05:47 +0000 (19:05 -0700)]
OSD,PG: Move pg accesible methods, objects to OSDService
In order to clarify data structure locking, PGs will now access
OSDService rather the the OSD directly. Over time, more structures will
be moved to the OSDService. osd_lock can no longer be held while pg
locks are held.
Samuel Just [Mon, 7 May 2012 20:51:55 +0000 (12:51 -0800)]
OSD: do not drop osd_lock in handle_osd_map
PGs have their map updates done in a different thread. Thus, we no
longer need to grab the pg locks. activate_map no longer requires
the map_lock in order to allow us to queue events for the pgs.
Samuel Just [Thu, 31 May 2012 05:19:58 +0000 (21:19 -0800)]
OSD: initialize pgs in get_or_create_pg via handle_create
Previously, pgs were initialized via Info/Log/etc. Since the event
which triggered the pg creation may now be queued, map update events may
occur before the event is processed. Thus, get_or_create_pg now handles
the initialization prior to queuing the event.
Samuel Just [Tue, 24 Apr 2012 23:00:49 +0000 (16:00 -0700)]
OSD,PG: handle pg map advance in process_peering_event
The pg map will now be advanced in process_peering_event (in advance_pg)
to allow handle_osd_map to not grab pg locks in-line. handle_osd_map
queues NullEvts to ensure that each pg is updated in a timely fashion.