Greg Farnum [Wed, 2 Feb 2011 17:57:14 +0000 (09:57 -0800)]
osd: Fix compile-time warning.
store is properly initialized inside a try block, but the
compiler doesn't notice that and so thinks it may be used
uninitialized. So initialize it to be NULL.
Change how cosd handles unfound objects when doing operations with
localize_reads. Specifically, don't wait for unfound objects unless we
are the primary for the relevant PG.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Samuel Just [Tue, 1 Feb 2011 21:48:39 +0000 (13:48 -0800)]
FileStore: fix double close
curr_fd is already closed if cp == cur_seq. This second close
occasionally ended up closing another thread's fd. The next open would
tend to grab that fd in op_fd or current_fd which would then get closed
by the other thread leaving op_fd or current_fd pointing to some random
file (or a closed descriptor).
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
In FileStore::umount: check if FDs are valid before closing them. Make
them invalid after closing them. Shut down FileStore::timer.
In FileStore::mkfs: always properly shutdown and free the filestore if
an error is encountered during mkfs. Check all functions that can fail.
Print out error messages on failures.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Sage Weil [Sat, 29 Jan 2011 00:25:31 +0000 (16:25 -0800)]
mds: implement journal reset
This basically works. Remaining issues:
- mydir and root inodes are recreated from scratch but need to be
reconciled with what's committed (outside the old journal)
- ?
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 28 Jan 2011 20:35:38 +0000 (12:35 -0800)]
mds: defer sending resolves until mdsmap.failed.empty()
There is no point sending resolves while there are still failed nodes,
since we can't complete. We also trigger an assert if we try to send to
a failed node. Instead just wait until failed.empty() and then start.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 28 Jan 2011 20:35:38 +0000 (12:35 -0800)]
mds: defer sending resolves until mdsmap.failed.empty()
There is no point sending resolves while there are still failed nodes,
since we can't complete. We also trigger an assert if we try to send to
a failed node. Instead just wait until failed.empty() and then start.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 28 Jan 2011 09:24:49 +0000 (01:24 -0800)]
osd: fix mutual exclusion for _dispatch
We want only one thread dispatching messages (either new or requeued), so
that we can preserve ordering. Previously we weren't doing so for all
callers of do_waiters (tick() and the first in ms_dispatch()).
This fixes osd_sub_op(_reply) ordering problems that trigger the
now-famous repop queue assert.
Sage Weil [Thu, 27 Jan 2011 16:47:48 +0000 (08:47 -0800)]
mds: cluster_fail instead of reset_cluster
Mark all cluster members as failed, and blacklist. Do not force up/failed
ranks to stopped, as that requires the admin to do other trickery. This
keeps the cluster fail orthogonal to any journal discard/reset.
Sage Weil [Wed, 26 Jan 2011 18:06:49 +0000 (10:06 -0800)]
osd: preserve ordering when ops are requeued
Requeue ops under osd_lock to preserve ordering wrt incoming messages.
Also drain the waiter queue when ms_dispatch takes the lock before calling
_dispatch(m).
Fixes: #743 Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Tue, 25 Jan 2011 23:28:49 +0000 (15:28 -0800)]
osd: restart if the osdmap client, heartbeat, OR cluster addrs don't match
If we somehow get ourselves into a situation where the OSDMap addresses do
not match our actual addresses, restart and try again. This is still
possible if multiple MOSDBoot messages end up in flight in the monitor,
say due to a monitor disconnect/reconnect, and we race with something that
marks us down in the map.
Sage Weil [Tue, 25 Jan 2011 23:04:06 +0000 (15:04 -0800)]
osd: avoid extraneous send_boot() calls
Only send_boot() on osdmap update if we are restarting. Otherwise we can
end up with too many MOSDBoot messages in flight and the monitor may
apply an old one instead of a new one. For example:
- cosd starts
- send_boot with address set A
- get an osdmap update
- send_boot again with address set A
- get an osdmap update. now we're up.
- get osdmap update, now we're marked down,
- bind to address set B
- send_boot with address set B
and the monitor may apply the second MOSDBoot (with adddress set A).
This results in an online OSD using a cluster address that differs from
that in the OSDMap. Which causes problems with peering, among other
things.
Greg Farnum [Wed, 26 Jan 2011 22:05:35 +0000 (14:05 -0800)]
librados: Remove rados_pool_t& usage, and pointless consts.
For some reason when I wrote this I passed rados_pool_t by reference
in some functions instead of by value. It's just a void*, so this is
silly.
Also silly, some of the passed-by-value rados_pool_ts were declared
to be const. WTF?
Samuel Just [Tue, 25 Jan 2011 21:58:36 +0000 (13:58 -0800)]
ReplicatedPG: _rollback_to fix the just cloned condition
_rollback_to in the case that head was just cloned and that clone
includes snapid does not need to do anything. Previously, snapid would
have to match the snap on the clone, but the condition should be that
snapid is contained within the clone's snaps set.
Greg Farnum [Tue, 25 Jan 2011 22:07:02 +0000 (14:07 -0800)]
MDSMonitor: fix bugs with standby-replay assignment.
We were accidentally passing gid instead of rank into find_standby_for!
Also, if we got an MDS with rank -1 we went ahead and used it. Broke
up the if statement tests to make sure that doesn't happen again.
Greg Farnum [Mon, 24 Jan 2011 18:56:39 +0000 (10:56 -0800)]
MDSMonitor: Don't create new map for standby-replay spam.
If an MDS is unable to get into the standby-replay state for some
reason (MDS it should be following doesn't exist yet, there aren't
any open MDSes, etc) it will spam the Monitor with beacons asking
to change state. These will always go to prepare_beacon since
they're asking for a state change, but can't be granted.
When this happens, return false, not true!
Greg Farnum [Mon, 24 Jan 2011 18:30:07 +0000 (10:30 -0800)]
MDSMonitor: be more conservative with use of pending_mdsmap.
Use the current mdsmap when looking for MDSes to standby-replay for,
as that way we know the other MDS is already up. Otherwise we could
try and come up together and potentially race.
Greg Farnum [Mon, 24 Jan 2011 18:28:22 +0000 (10:28 -0800)]
MDS: MDSMonitor: Make MDS set standby-replay preferences, not MDSMonitor.
The MDS has more information about its configuration than the MDSMonitor
does. Therefore, encode that information into the standby_for_rank,
and let the monitor just operate based off that. This reduces magic
numbers and should be more robust.
Greg Farnum [Thu, 20 Jan 2011 21:37:00 +0000 (13:37 -0800)]
MDSMonitor: Try to assign unassigned standby-replay MDSes during tick()
We can now specify an MDS as standby-replay and let the monitor
assign it to any MDS. The monitor will only assign it to an
MDS that doesn't already have a hot standby, though.