Sage Weil [Wed, 10 Aug 2011 23:07:33 +0000 (16:07 -0700)]
librados: implement/document tmap_{get,put}
These aren't strictly necessary now (you can just read the raw object or
do a writefull and get the same thing) but this way we document the format
and can change the backend to be smarter in the future without changing up
the interface.
Sage Weil [Wed, 10 Aug 2011 22:38:25 +0000 (15:38 -0700)]
mds: don't wait for lock 'flushing' flag on replicas
If we are a replica, the 'flushing' means that we had dirty scatterlock
data and are waiting for it to get flushed out to the auth copy (by
cycling from MIX->LOCK, normally). If we end up with 'flushing' set
while in the MIX state, we can't wait for it to clear before responding
to a lock request from the primary or we'll deadlock.
On the auth, flushing means flushing to the log, which makes sense; that
will always make progress despite scatterlock activity.
This fixes a hang from 3-mds fsstress with thrashing exports. (Strangely
I never hit this on fatty.)
Josh Durgin [Tue, 9 Aug 2011 23:14:23 +0000 (16:14 -0700)]
librbd: deduplicate sparse read interpretation
AioBlockCompletions and read_iterate each had their own copy of this
code, leading to bugs when only one was changed. Move this to a
separate function, handle_sparse_read.
Greg Farnum [Tue, 9 Aug 2011 22:43:11 +0000 (15:43 -0700)]
objecter: allow requesting specific maps in maybe_request_map
Use this capability so that wait_for_new_map can specify a specific
map it wants go get. I am extending the sins of the fathers by
allowing a default value of 0 for the epoch here, but removing it
is even clumsier and there are lots of uses where it legitimately
doesn't care what epoch it gets.
Greg Farnum [Mon, 8 Aug 2011 21:19:34 +0000 (14:19 -0700)]
pgmon: use pool.get_last_change whenever creating new PGs
We maintain last_change properly now, so we can use it at any time.
It may still be possible that we can get PGs with the wrong epoch
created if, somehow, we do multiple expansions without
check_osd_map getting called between them. That's a pretty unlikely
occurrence, though, and I'm not sure that it's actually possible.
Greg Farnum [Mon, 8 Aug 2011 20:48:29 +0000 (13:48 -0700)]
pgmon: call check_osd_map via a new on_active implementation
Previously it was possible to lose PG creations if a monitor election
happened at the right time. The issue would get rectified on the
next OSDMap update, but that could take...a while. (My observed time
when I discovered the bug had it go without creation for 43 minutes,
at which point I killed it.)
Sage Weil [Mon, 8 Aug 2011 19:16:41 +0000 (12:16 -0700)]
debian: explicitly bind library users to matching version
We are cheating with the shared libs by making small API changes without
bumping the soname. Bind users to a matching version to minimize user
pain. When the APIs become fully stable these will need to go away.
Fixes: #1354 Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Fri, 5 Aug 2011 21:28:29 +0000 (14:28 -0700)]
mds: chain rename subtree projections
We can have two renames for the same file in flight to the journal. Stack
them up in a list. The old project_subtree_rename() should have asserted
that the item wasn't already in the map before inserting it to catch this
at the front end. Now it doesn't matter; it's a list.
Don't allow string-valued configuration items to be changed using
injectargs unless they have observers. Otherwise, we could have
crashes, since one thread could be reading the std::string's internal
buffer after another thread frees that buffer during assignment.
Write a unit test to validate this behavior.
Also test that we can turn on and off the log_file using injectargs.
This is something that injectargs often gets used for in practice.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Greg Farnum [Fri, 5 Aug 2011 18:07:03 +0000 (11:07 -0700)]
pgmon: create ALL pgs in new pools with correct creation epoch
5bb07df6aa4684ebd2e70437081dea170464d8ee tried to do this, but it
only set them properly for localized PGs. Whoops!
Additionally, we do NOT want to do this for new PGs in pre-existing
pools. Unfortunately, we have no way of guaranteeing that these new
PGs in old pools have the right epoch -- the data doesn't exist.
I'll discuss with other team members; it's possible that last_change
is in fact supposed to deal with this and simply doesn't.
Meanwhile, I've created a new bug to track this: #1365.
Greg Farnum [Thu, 4 Aug 2011 23:11:43 +0000 (16:11 -0700)]
pgmon: create PGs with a creation epoch that matches the pool's
Previously, the PGs were created with a creation date of the current
OSDMap. However, under some circumstances the PG creation can occur
under a later OSDMap epoch. This could lead to client requests that
were sent under epoch n, while the OSD insisted that the PG was
created in epoch n+1 so the client request was invalid. Since the
OSD expects the client to resubmit requests in such circumstances,
and the client thought all was well, this led to hanging client
requests.
Greg Farnum [Thu, 4 Aug 2011 21:34:57 +0000 (14:34 -0700)]
osd: Fix last_epoch_started initialization on new PGs
This used to be safe by virtue of assigning same_acting_since
to osdmap->get_epoch(), but since we fixed bugs by handling
that better we now need to update the last_epoch_started
initialization.
Greg Farnum [Thu, 4 Aug 2011 21:32:55 +0000 (14:32 -0700)]
osd: rename variables in project_pg_history.
These are always the current sets; I don't know why they would
ever be called lastup and it confused me when I started looking
at this. Confusion means something needs to change!
Sage Weil [Thu, 4 Aug 2011 20:48:55 +0000 (13:48 -0700)]
osd: expect heartbeats from anyone peering depends on
We were getting heartbeats from just acting replicas. That's really not
enough if we want to be sure to detect failures of OSDs we depend on,
which includes any stray or up OSDs as well.
Greg Farnum [Thu, 4 Aug 2011 19:43:30 +0000 (12:43 -0700)]
osd: refactor PG creation slightly.
We want to carefully set up the PG History. In most cases this will
do the same thing as previously, since it's a brand new PG. However,
it's possible for the monitor to send out MOSDPGCreate messages for
the same PG to more than one OSD over many epochs if the OSDs are down
or are slow to respond. In that case we need to carefully track
history to make sure we don't lose old data.
Greg Farnum [Thu, 4 Aug 2011 16:15:25 +0000 (09:15 -0700)]
osd: Initialize new PGs with correct info.history.same_primary_since
Previously we were initializing based on the local osdmap epoch, which
is often correct, but if we process the MOSDPGCreate in an epoch after
the PG was created in the OSDMap, we could have problems with clients
sending out messages based on the creation epoch which the OSD would
reject as being earlier than same_primary_since. See bug #1357.