Sage Weil [Wed, 17 Nov 2010 22:37:38 +0000 (14:37 -0800)]
osd: rev PG::Info encoding for last_epoch_clean change
This was missed by 184fbf582b27c10b47101735a4495fe8c73ad186, so any fs
created between now and then won't decode properly. It's more important
to make an fs prior to that work, though, so that the upgrade path from
the last stable version works.
Sage Weil [Wed, 17 Nov 2010 19:39:24 +0000 (11:39 -0800)]
mds: wrlock scatterlocks to prevent a gather racing with split/merge logging
We have the dirs split in our cache for some time while journaling it to
disk, before the fragment_notify goes out. Make sure we don't do a
scatterlock gather during that time that will confuse the inode auth (who
has their dirfrags fragmented differently).
Track discover requests by tid. The old system of tracking outstanding
discovers was kludgey and somewhat broken. Also there is a possibility
of getting dup replies if someone does kick_requests().
There is still room for improvement with the logic detemrining when a
discover is sent: we may want to discover multiple dirfrags in parallel,
but the current code will only do one at a time.
Signed-off-by: Sage Weil <sage@newdream.net>
comment
Jim Schutt [Wed, 17 Nov 2010 20:39:52 +0000 (13:39 -0700)]
Detect broken system linux/fiemap.h
RedHat 5.5 has a /usr/include/linux/fiemap.h, but it is
broken because it does not itself include linux/types.h.
As a result, __u64 and friends are not defined.
We have a Ceph-local copy of fiemap.h, so use it
if the system version is broken.
While we're at it, fix up the configure message to
note we're using a local copy.
Signed-off-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Sage Weil <sage@newdream.net>
There are two phases in recovery: one where we get all the right objects
on to the primary, and another where we push all those objects out to
the replicas. Formerly, we would not start the second phase until there
were no missing objects at all on the primary.
This change modifies that so that we will start the second phase even if
there are unfound objects. However, we will still wait for all findable
missing objects to be brought to us, of course.
Get rid of uptodate_set. We can find the same information by looking at
the missing and missing_loc sets directly. Keeping the uptodate_set...
er... up-to-date would be very difficult in the presence of all the things
that can modify the missing and missing_loc sets.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Add discover_all_missing. This function makes sure that we have messages
en route to any OSD that we think might have information that could help
us discover where our unfound objects lie.
We call this function:
* In do_peer, right after activating the PG
* In _process_pg_info, when we're the primary of an active PG
* From handle_pg_notify, when we're the primary of an active PG
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
PG::search_for_missing: fix a bug with the handling of MSG_OSD_PG_INFO
messages. Formerly, when processing these messages, we erroneously
assumed that there was nothing missing on the peer at all even in cases
where there were missing objects.
PG::merge_log: drop unused Missing parameter.
_process_pg_info: Don't assume that just because we requested a Log
message at some point, that that is the message we're prcessing.
Correctly handle cases where we didn't get the peer's Missing set or
Log.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Add MOSDPGMissing, a message which just contains the missing objects
information for a PG. We will request messages like this one in order to
locate all of our unfound objects.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Mon, 15 Nov 2010 04:26:52 +0000 (20:26 -0800)]
msgr: use provided rx buffer if present
This changes the read path so that we hold the Connection::lock mutex while
reading data off the socket. This ensures that we are reading into a
buffer we are allowed to use, and allows users to revoke a previously
posted buffer. If that happens, switch over to a newly allocated buffer.
Note that currently the final result bufferlist may contain part of the
provided buffer and part of a newly allocated buffer. This is okay as long
as we will always read the same data into the buffer. And in practice, if
the rx buffer is revoked then the message itself will be thrown out anyway.
We have to explictly shut down the timer in Objecter::shutdown.
Otherwise, we are relying on the destructor of SafeTimer to do it.
Unfortunately, that destructor gets called after the mutex the timer is
using has already been destroyed.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Fri, 12 Nov 2010 23:56:54 +0000 (15:56 -0800)]
msgr: do not clear halt_delivery
We need to keep the halt_delivery plug set on failure/shutdown in order to
prevent a racing reader from queuing new messages. The only time we clear
it is when we discard messages due to a session reset.
Sage Weil [Fri, 12 Nov 2010 21:09:24 +0000 (13:09 -0800)]
msgr: only close socket on reconnect or shutdown
We can't modify 'sd' or (more importnatly) close sd while any other thread
might be using it, or else we might race with an open and they might end
up using someone else's fd.
Take care to _only_ close(sd) in connect(), when the reader thread is
stopped, or when reaping the connection.
Sage Weil [Fri, 12 Nov 2010 21:41:14 +0000 (13:41 -0800)]
msgr: protect pipe queuing with _both_ pipe and dispatch_queue locks
We want to make sure the pipe's queue item doesn't go away.
Also, make queue_received() require pipe_lock to be held. This avoids some
useless unlocking/locking, since (in the case where the pipe is already
queued) we then don't need to drop the pipe_lock at all.
Sage Weil [Fri, 12 Nov 2010 15:55:41 +0000 (07:55 -0800)]
uclient: insert lssnap results under snapdir, not live dir
Put the readdir results (list of snapshots) in the right place in the
hierarchy; we were putting them in the parent dir (as if they were real
directories).
This bug manifested itself as a snaptest-2.sh failure.
Sage Weil [Fri, 12 Nov 2010 00:38:02 +0000 (16:38 -0800)]
timer: rewrite mostly from scratch
Just use the provided lock. This _vastly_ reduces the complexity because
we don't have to worry about races between our thread trying to fire off a
timer that is being canceled.
Sage Weil [Thu, 11 Nov 2010 04:58:49 +0000 (20:58 -0800)]
mds: fix null_snapflush with multiple intervening snaps
The client is allowed to not send a snapflush if there is no dirty metadata
to write for a given snap. However, the mds can only look up inodes by
the last snapid in the interval. So, when doing a null_snapflush (filling
in for snapflushes the client didn't send), we have to walk forward through
intervening snaps until we find the right inode.
Note that this means we will call _do_snap_update multiple times on the
same inode, but with different snapids.
Sage Weil [Wed, 10 Nov 2010 23:33:31 +0000 (15:33 -0800)]
osd: call sched_scrub on reserve reply
Otherwise we have to wait until the next time it's called by the timer, and
during that period we have a reservation locally, and any other peers can't
reserve a scrub from us, and nobody makes any progress.
PG::search_for_missing: when we find a previously unfound object, check
to see if there is an entry in waiting_for_missing_object representing a
client waiting for this object.
PG::repair_object: assert that waiting_for_missing_object is empty
before messing with missing_loc. It definitely should be during a scrub.
ReplicatedPG role change logic: always take_object_waiters on the wait
queues when the PG acting set changes.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>