Sage Weil [Fri, 18 May 2012 01:13:57 +0000 (18:13 -0700)]
mon: send join message if we are in monmap with blank addr
Being in the monmap with a blank ip is possible if we were a initial member
seed but weren't part of the first election/quorum. If that's the case,
first update ourselves before we try to call an election and join the
quorum.
Sage Weil [Fri, 18 May 2012 01:17:18 +0000 (18:17 -0700)]
mon: simplify/clean up dummy addrs used for initial members
Use a complete blank IP, but set the nonce. This way
entity_addr_t::is_blank_ip() works. We're also outside of the namespace of
possible addrs that the mons *could* use, which is a bit cleaner.
Sage Weil [Fri, 18 May 2012 00:51:47 +0000 (17:51 -0700)]
mon: set our addr when populating monmap with initial members
If the seed monmap doesn't contain us and we are populating it with
initial members, and one of those members is us, use our addr instead of
using a dummy one.
Sage Weil [Fri, 18 May 2012 00:50:49 +0000 (17:50 -0700)]
mon: add peers probing us to extra peer list
If we are probed by another monitor, add them to our extra probe list. This
lets us rely on the active probe/reply to gather information and not infer
anything from here.
Sage Weil [Thu, 17 May 2012 21:50:31 +0000 (14:50 -0700)]
qa: add a bunch of mon bootstrap tests
These's are comprehensive because a lot of the startup logic is about
picking a local address, and it's difficult to do test that on a single
host. They cover the other variables surrounding mon bringing up, though:
- part of initial monmap, or not
- new nodes given all prior nodes, or not
- new nodes have self included in monmap seed, or not
- initial quorum members
Sage Weil [Thu, 17 May 2012 21:46:46 +0000 (14:46 -0700)]
mon: ignore election messages from outside monmap
These shouldn't(tm) happen with new code, but with old code they do. And
if we get them, elector can try to monmap->get_inst() on them and crash.
Throw them out here; they're nonsense from our perspective anyway if the
peer isn't part of our monmap.
Sage Weil [Thu, 17 May 2012 20:27:30 +0000 (13:27 -0700)]
mon: limit initial quorum to mon_initial_members
This is a two-stage process.
* If we start up, and have never joined a quorum, and initial members are
specified, only include them in the monmap; put all others in the
extra probe list.
* We add missing members if necessary to make the monmap (and initial
quorum) the right size.
* If we probe someone that *has* participated in a successful quorum, we
get their monmap, and restart, so this is moot.
* We only call an election to create a new cluster if outside_quorum gets
big enough (it will only include initial members) and if it includes us.
Sage Weil [Thu, 17 May 2012 19:36:41 +0000 (12:36 -0700)]
mon: use current monmap for initial quorum
This makes a bit more sense. Don't use the seed monmap, but use the one
we ended up with when we formed our first quorum. This will do a better
job of picking up names of peers, and also ensure we get a map based on
the mon initial members (if specified).
Sage Weil [Thu, 17 May 2012 18:08:38 +0000 (11:08 -0700)]
mon: take probed peer's monmap if it has ever joined a quorum
If we probe a peer and their monmap has actually been part of a started
cluster/quorum, and ours hasn't, take theirs. Comparing versions isn't
sufficient.
Sage Weil [Thu, 17 May 2012 18:09:24 +0000 (11:09 -0700)]
monmaptool: don't increment epoch on modification
This just confuses things, because a manually manipulated map might have
some epoch number that bears no relation to the actual published/committed
maps.
Sage Weil [Thu, 17 May 2012 18:00:46 +0000 (11:00 -0700)]
mon: clean up "joined" flag
- check flag on init, keep in memory
- set flag only when we are active and have a committed monmap. i.e., when
we are active participants in a bootstrapped/created mon cluster.
When a given log level L was specified, we would reply with all the
messages of "level L and below"; for instance, for a 'log-error' we would
present all the messages of level 'error', 'warn', 'sec', 'info' and
'debug'.
We shouldn't be doing it that way, so we just inverted the filter
condition. Now we show only 'L and above'; i.e., for a log level of
'log-warn', show only 'log-warn' and 'log-error'.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
src: get rid of the Observers throughout the code base.
This is a big patch that will remove all references to the observers
throughout the code, including a complete removal of the Observer-related
messages' source files.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
We reworked the code a bit to accommodate the introduction for the log
monitor's publish/subscribe mechanisms. With this patch we no longer
depend on the observer's, and use instead the much broader approach of
subscribing to events. In our case, we will subscribe to log levels.
If the '-w'/'--watch' flag is defined, the tool will be subscribed to the
'log-info' level by default, unless one of the following flags are defined
(in which case the level will be changed accordingly): '--watch-debug',
'--watch-info', '--watch-sec', '--watch-warn' and '--watch-error'.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
mon: Add publish/subscribe capabilities to the log monitor and status cmd.
This patch allows us to stir away from the monitor's observer mechanism,
by using instead the already existing publish/subscribe mechanism.
We follow the log levels used by the log monitor, and will recognize any
one of the following subscriptions: 'log-error' (higher priority),
'log-warn', 'log-sec', 'log-info' and 'log-debug' (lowest priority).
Also, add a new 'status' command to the monitor, which may be invoked by
any client (such as the ceph tool), and which shall return the status of
the various cluster components (osdmap, pgmap, ...).
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Josh Durgin [Wed, 16 May 2012 19:41:27 +0000 (12:41 -0700)]
librbd: check for cache flush errors
Return errors from flushing to the caller. Warn
if an error occurs during invalidation, but don't retry,
since the higher level handles these cases, namely:
* rollback (doing this with an image open is asking for trouble)
* shrink (doing this with writes in flight may create extra objects anyway)
* shutdown (qemu flushes before closing the device)
Josh Durgin [Tue, 15 May 2012 22:21:50 +0000 (15:21 -0700)]
ObjectCacher: handle write errors
If a write error occurs, mark the BufferHead dirty again, and
pass the return value to the completion. This makes flushing
return the write error, if one occurs, since the flush callback
is passed as the write callback.
Josh Durgin [Tue, 15 May 2012 17:58:59 +0000 (10:58 -0700)]
ObjectCacher: propagate read errors to the caller
Previously the return value of a read operation was ignored. Now a
read error sets the error field, and changes the BufferHead to a new
error state. Error state BufferHeads are treated as misses so they can
be retried when requested by a user of the ObjectCacher. When _readx
is called again internally, they're treated as hits so the error can
be returned to the user.
The error value is ignored if the BufferHead is not in the error
state.
Sage Weil [Wed, 16 May 2012 22:37:34 +0000 (15:37 -0700)]
mon: fix mon removal check
Only take our absence from the monmap to mean that we were removed if we
were ever a member in the first places.
This fixes the bootstrap case:
- create temp_monmap with existing member(s) plus new guy
- ceph-mon --mkfs --monmap temp_monmap --fsid ...
- start ceph-mon
Basically, this is just using the seed monmap as a way to tell the new
daemon which ip:port to use. Specifying mon addr, public network, or
public addr would also work.
Fixes: #2436 Signed-off-by: Sage Weil <sage@inktank.com>
Josh Durgin [Wed, 16 May 2012 20:40:43 +0000 (13:40 -0700)]
ObjectCacher: only perfcount reads requested by the client
_readx is called again after each bh is read by C_RetryRead. This
resulted in the read being counted many times for the internal
caller that was just checking whether it was done yet.
Sage Weil [Tue, 15 May 2012 04:01:58 +0000 (21:01 -0700)]
monmap: use feature bits and single encode() method
Instead of selecting an encode method in the caller, use a normal features
argument to encode() and branch there.
Leave behavior of all callers untouched. We continue to assume, for
example, that all monitors have the same features, and that
'ceph mon getmap' should return the fully-featured encoding.
Josh Durgin [Mon, 14 May 2012 18:49:49 +0000 (11:49 -0700)]
Objecter: don't throttle resent linger ops
Throttling is intended to stop the caller from submitting too many
requests, not blocking requests that are being resent internally. This
prevents a deadlock when handling an osdmap - previously
handle_osd_map could block when resending linger ops due to the
throttling. This would stop the messenger's dispatch thread from
delivering any subsequest messages, so the throttle budget would never
be replenished.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Tue, 8 May 2012 23:30:26 +0000 (16:30 -0700)]
mon: use external keyring for mon->mon auth
- Feed our keyring into the auth methods.
- Do not fail to build a ticket for type MON when we don't have a cap; it
won't be in the auth database. Also, we don't have caps on the monitors
that are enfoced between each other.
Sage Weil [Tue, 15 May 2012 03:13:40 +0000 (20:13 -0700)]
mon: keep mon. secret in an external keyring
- Keep the mon. key in a separate keyring files, "keyring", in the mon
data dir.
- During init, if we don't find that file, copy the key from the keyserver
database.
- During mkfs, put the mon. key in that file, and remove it from the seed
file that primes the auth database.
This will allow admins to change the mon. key without bringing the cluster
online and doing something wonky.