We are converting the monitor subsystem to a Single-Paxos architecture,
backed by a key/value store. The previous architecture used a Paxos
instance for each Paxos Service, backed by a nasty Monitor Store that
provided few to no consistency guarantees whatsoever, which led to a fair
amount of workarounds.
Changes:
* Paxos:
- Add k/v store support
- Add documentation describing the new Paxos storage layout and behavior
- Get rid of the stashing code, which was used as a consistency point
mechanism (we no longer need it, because of our k/v store)
- Debug level of 30 will output json-formatted transaction dumps
- Allows for proposal queueing, to be proposed in the same order as
they were queued.
- No more 'is_leader()' function, using instead the Monitor's for
enhanced simplicity.
- Add 'is_lease_valid()' function.
- Disregard 'stashed versions'
- Make the paxos 'state' variable a bit-map, so we lock the proposal
mechanism while maintaining the state [5].
- Related notes: [3]
* PaxosService:
- Add k/v store support, creating wrappers to be used by the services
- Add documentation
- Support single-paxos behavior, creating wrappers to be used by the
services and service-specific version
- Rearrange variables so they are neatly organized in the beginning of
the class
- Add a trim_to() function to be used by the services, instead of letting
them rely on Paxos::trim_to(), which is no longer adequate to the job
at hand
- Debug level of 30 will output json-formatted transaction dumps
- Support proposal queueing, taking it into consideration when
assessing the current state of the service (active, writeable,
readable, ...)
- Redefine the conditions for 'is_{active,readable,writeable}()' given
the new single-paxos approach, with proposal queueing [1].
- Use our own waiting_for_* callback lists, which now must be
dissociated from their Paxos counterparts [2].
- Related notes: [3], [4]
* Monitor:
- Add k/v store support
- Use only one Paxos instance and pass it down to each service instance
- Crank up CEPH_MON_PROTOCOL to 10
* {Auth,Log,MDS,Monmap,OSD,PG}Monitor:
- Add k/v store support
- Add single-paxos support
* AuthMonitor:
- Don't always propose full versions: if the KeyServer doesn't have
keys, we cannot propose a full version. This should only happen when
we start with a brand new store and we are creating the first
pending proposal, and if we were to commit a full version filled
with nothing but a big void of nothingness, we could eventually end
up with a corrupted version.
* Elector:
- Add k/v store support
- Add single-paxos support
* ceph-mon:
- Use the monitor's k/v store instead of MonitorStore
* MMonPaxos:
- remove the machine_id field: This field was used to identify from/to
which paxos service a given message belonged. We no longer have a Paxos
for each service, so this field became obsolete.
Notes:
[1] Redefine the conditions for 'is_{active,readable,writeable}()' on
the PaxosService class, to be used with single-paxos and proposal
queueing:
We should not rely on the Paxos::is_*() functions, since they do not apply
directly to the PaxosService.
All the PaxosService classes share the same Paxos class, but they do not
rely on its values. Each service only relies, uses and updates its own
values on the k/v store. Thus, we may have a given service (e.g., the
OSDMonitor) proposing a new value, hence updating or waiting to update its
store, and we may still consider the LogMonitor as being able to read and
write its own values on the k/v store. In a nutshell, different services
do not overlap on their access to their own store when it comes to reading,
and since the Paxos will queue their updates and deal with them in a FIFO
order, their updates won't overlap either.
Therefore, the conditions for the PaxosService::is_{active,readable,
writeable} differ from those on the Paxos::is_{active,readable,writeable}.
* PaxosService::is_active() - the PaxosService will be considered as
active iff it is not proposing and the Paxos is not recovering. This
means that a given PaxosService (e.g., the OSDMonitor) may be considered
as being active even though some other service (e.g., the LogMonitor) is
proposing a new value and the Paxos is on the UPDATING state. This means
that the OSDMonitor will be able to read its own versions and queue any
changes on to the Paxos. However, if the Paxos is on state RECOVERING,
we cannot be considered as active.
* PaxosService::is_writeable() - We will be able to propose new values
iff we are the Leader, we have a valid lease, and we are not already
proposing. If we are proposing, we must wait for our proposal to finish
in order to proceed with writing to our k/v store; otherwise we could
incur in assuming that our last committed version was, say, 10; then
assign map epochs/versions taking that into consideration, make changes
to the store based on those values, just to come to smash previously
proposed values on the store. We really don't want that. To be fair,
there was a chance we could assume we were always writable, but there
may be unforeseen consequences to this; so we take the conservative
approach here for now, and we will relax it in the future if we believe
it to be fruitful.
* PaxosService::is_readable() - We will be readable iff we are not
proposing and the Paxos is not recovering; if our last committed version
exists; and if we are either a cluster of one or we have a valid lease.
[2] Use own waiting_for_* callback lists on PaxosService, which now must
be dissociated from their Paxos counterparts:
We were relying on Paxos to wait for state changes, but since our state
became somewhat independent from the Paxos state, we have to deal with
callbacks waiting for 'readable', 'writable' or 'active' on different
terms than those that Paxos provide.
So, basically, we will take one of two approaches when it comes to waiting:
* If we are proposing, queue ourselves on our own list, waiting for the
proposal to finish;
* Otherwise, the cause for the need to wait comes from Paxos, so queue
the callback directly on Paxos.
This approach means that we must make sure to check our desired state
whenever the callback is fired up, and re-queue ourselves if the state
didn't quite change (or if it changed but our waiting condition result
didn't). For instance, if we were waiting for a proposal to finish due to
a failed 'is_active()', we will need to recheck if we are active before
continuing once the callback is fired. This is mainly because we may have
finished our proposal, but a new Election may have been called and the
Paxos may not be active.
[3] Propose everything in the queue before bootstrapping, but don't
allow new proposals:
The MonmapMonitor may issue bootstraps once it is updated. We must ensure
that we propose every single pending proposal before we actually do it.
However, ee don't want to propose if we are going to bootstrap; otherwise,
we may end up losing proposals.
[4] Handle the case when first_committed_version equals 0 on a
PaxosService
In a nutshell, the services do not set the first committed version, as
they consider it as a SEP (Somebody Else's Problem). They do rely on it
though, and we, the PaxosService, must ensure that it contains a valid
value (that is, higher than zero) at all times.
Since we will only have a first_committed version equal to zero once,
and that is before the service's first proposal, we are safe to simply
read the variable from the store and assign the first_committed the same
value as the last_committed iff the first_committed version is zero.
This also affects trimming, since trimming relies on the first_committed
version as the lower bound for version trimming. Even though the k/v store
will gracefully ignore any problem from trying to remove non-existent
versions, the main issue would still stand: we'd be removing a non-existent
version and that just doesn't make any sense.
[5] 'lock' paxos when we are running some internal proposals
Force the paxos services to wait for us to complete whatever we are
doing before they can proceed. This is required because on certain
occasions we might need to run internal proposals, not affected to any of
the paxos services (for instance, when learning an old value), and we need
them to stay put, or they might incur in erroneous state and crash the
monitor.
This could have been done with an extra bool, but there was no point
in creating a new variable when we can just as easily reuse the
'state' variable for our twisted interests.
Fixes: #4175 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
mon: Remove global version code introduced around bobtail's release
This patch reverts most of the global version (gv) related patches that
were introduced around bobtail's release as a prelude to the single-paxos
patches.
The gv infrastructure allowed us to gather version information on the
monitors, essential to the move to a single-paxos implementation on
existing clusters -- this means that for an existing cluster to upgrade
to the a single-paxos monitor, it will first have to be upgraded to a
version prior to this patch. This patch strips the monitor subsystem of
all the gv-related code that is of no use for upcoming versions.
Furthermore, from this patch onwards until all single-paxos patches
are merged, ceph-mon won't work as expected, and may not compile at some
point in the git history.
These patches are not retro-compatible, and the monitors are not expected
to work with earlier versions.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Mon, 11 Feb 2013 14:23:54 +0000 (06:23 -0800)]
osd: update snap collections for sub_op_modify log records conditionaly
The only remaining caller is sub_op_modify(). If we do have a non-empty
op transaction, we want to do this update, regardless of what we think
last_backfill is (our notion may be not completely in sync with the
primary). In particular, our last_backfill may be the same object but
a different snapid, but the primary disagrees and is pushing an op
transaction through.
Instead, update the collections if we have a non-empty transaction.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Mon, 11 Feb 2013 00:59:48 +0000 (16:59 -0800)]
osd: unconditionally encode snaps buffer
Previously we would only encode the updated snaps vector for CLONE ops.
This doesn't work for MODIFY ops generated by the snap trimmer, which
may also adjust the clone collections. It is also possible that other
operations may need to populate this field in the future (e.g.,
LOST_REVERT may, although it currently does not).
Fixes: #4071, and possibly #4051. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Yehuda Sadeh [Fri, 8 Feb 2013 21:14:49 +0000 (13:14 -0800)]
rgw: change json formatting for swift list container
Fixes: #4048
There is some difference in the way swift formats the
xml output and the json output for list container. In
xml the entity is named 'name' and in json it is named
'subdir'.
Sage Weil [Sat, 9 Feb 2013 08:05:33 +0000 (00:05 -0800)]
osd: fix load_pgs collection handling
On a _TEMP pg, is_pg() would succeed, which meant we weren't actually
hitting the cleanup checks. Instead, restructure this loop as positive
checks and handle each type of collection we understand.
This fixes _TEMP cleanup.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Sat, 9 Feb 2013 08:04:29 +0000 (00:04 -0800)]
osd: fix load_pgs handling of pg dirs without a head
If there is a pgid that passes coll_t::is_pg() but there is no head, we
will populate the pgs map but then fail later when we try to do
read_state. This is a side-effect of 55f8579.
Take explicit note of _head collections we see, and then warn when we
find stray snap collections.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Fri, 8 Feb 2013 06:06:14 +0000 (22:06 -0800)]
mon: handle -EAGAIN in completion contexts
We can get ECANCELED, EAGAIN, or success out of the completion contexts,
but in the EAGAIN case (meaning there was an election) we were sending
a success to the client. This resulted in client hangs and all-around
confusion when the monitor cluster was thrashing.
Backport: bobtail Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Luis <joao.luis@inktank.com>
Samuel Just [Thu, 7 Feb 2013 19:53:28 +0000 (11:53 -0800)]
ReplicatedPG: check store for temp collection in have_temp_coll
We may not have "created" the temp collection since OSD restart
before removing the PG. have_temp_coll must also look at the
OSD store. Currently, the only user is pg removal, so the
extra work is acceptable.
Backport: bobtail Signed-off-by: Samuel Just <sam.just@inktank.com>
Yehuda Sadeh [Thu, 7 Feb 2013 00:43:48 +0000 (16:43 -0800)]
rgw: bucket recreation should not clobber bucket info
Fixes: #4039
User's list of buckets is getting modified even if bucket already
exists. This fix removes the newly created directory object, and
makes sure that user info's data points at the correct bucket.
Sage Weil [Thu, 7 Feb 2013 18:21:49 +0000 (10:21 -0800)]
osd: flush peering queue (consume maps) prior to boot
If the osd itself is behind on many maps during boot, it will get more and
(as part of that) flush the peering wq to ensure the pgs consume them.
However, it is possible for OSD to have latest/recnet maps, but pgs to be
behind, and to jump directly to boot and join. The OSD is then laggy and
unresponsive because the peering wq is way behind.
To avoid this, call consume_map() (kick the peering wq) at the end of
init and flush it to ensure we are *internally* all caught up before we
consider joining the cluster.
I'm pretty sure this is the root cause of #3905 and possibly #3995.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Yehuda Sadeh [Thu, 13 Dec 2012 23:52:34 +0000 (15:52 -0800)]
rgw: key indexes are only link to user info
Instead of keeping multiple copies of the user info,
we just treat the key index as a pointer to the actual
user info (indexed by uid). This helps with two issues:
first, it scales better as we don't need to update the
entire set of keys whenever we make any change. Second,
it helps with the uid index atomicity.
One point to keep in mind is that both the links and the
info can be cached, so effect on performance is minimal.
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: caleb miles <caleb.miles@inktank.com>
Dan Mick [Thu, 31 Jan 2013 01:33:09 +0000 (17:33 -0800)]
Validate format strings for CLS_ERR/CLS_LOG
cls_log needed __attribute__((format(printf..)) to allow the compiler
to crosscheck format strings and arguments. After adding that, there
needed to be a bunch of fixups for %ll, and a few changes for missing
arguments, etc. uncovered by the checking.
Fixes: #3970 Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Alex Elder [Thu, 31 Jan 2013 12:47:59 +0000 (06:47 -0600)]
qa: update the rbd/concurrent.sh workunit
A few changes, now that a few rbd problems have been fixed.
First, the more substantive changes:
- Generate a source file, and compare what's read back from rbd
devices with the content of that file.
- Write to the rbd device such that the written data spans
an (assumed 4 MB) rbd object boundary, as well as starting
and ending on non-page-aligned offsets.
- Perform multiple reads on rbd devices: entirely within a range
before any written data; beginning before but ending within
written data; the exact written data (and validating what's
read); beginning within written data but ending after it;
reading after written data but within a written rbd object;
and reading from an unwritten rbd object.
- Have the sleep between iterations provide a non-integer value
to avoid zero (or quantized) delays.
Also, some a little less substantive (but possibly informative):
- Don't run with "set -x". It produces a ton of noise that is
not useful for this test. This is an exerciser, looking
really for system crashes during concurrent activity, and
knowing which commands were (concurrently) active isn't going
to help much in diagnosis.
- Create two more directories, used to track the degree of
concurrency (more or less) and the highest rbd id consumed.
Files whose names are numbers are touched in each, and the
highest at the end is the highest during the run. This gets
around issues passing environment info from sub-shells to the
top-level shell. As a bonus, it offers a better chance of
avoiding problems due to concurrent update.
- NAMESDIR is renamed NAMES_DIR, and it (and the others) is
set up in the setup() function.
- Increase the concurrency and iteration counts.
- Move the default definitions before the ceph secrets stuff
Danny Al-Gaaf [Wed, 30 Jan 2013 17:52:24 +0000 (18:52 +0100)]
PGMap: fix -Wsign-compare warning
Fix -Wsign-compare compiler warning:
mon/PGMap.cc: In member function 'void PGMap::apply_incremental
(CephContext*, const PGMap::Incremental&)':
mon/PGMap.cc:247:30: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Dan Mick [Wed, 30 Jan 2013 02:41:20 +0000 (18:41 -0800)]
mds/Server.cc: fix warring assert.h's
New include boost/lexical_cast.hpp apparently drags in the system
assert.h on quantal and squeeze at least, breaking our careful
assert.h; re-include our file to fix it back
Fixes: #3957 Signed-off-by: Dan Mick <dan.mick@inktank.com>
Dan Mick [Tue, 29 Jan 2013 23:18:53 +0000 (15:18 -0800)]
init-ceph: make ulimit -n be part of daemon command
ulimit -n from 'max open files' was being set only on the machine
running /etc/init.d/ceph. It needs to be added to the commands to
start the daemons, and run both locally and remotely.
Verified by examining /proc/<pid>/limits on local and remote hosts
Fixes: #3900 Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Loïc Dachary <loic@dachary.org> Reviewed-by: Gary Lowell <gary.lowell@inktank.com>