mon: mkfs compatset may be different from runtime compatset
When we create a monitor we set a given number of compat features on
disk to clearly state the features a given monitor supports -- mostly to
break backward compatibility when such compatibility cannot be
guaranteed.
However, we may wish to toggle some features during runtime; e.g., wait
for all the monitors in the quorum to support a given feature before
flipping a switch and state that all monitors now require feature X.
We are already flipping those switches during runtime, but we weren't
allowing the monitor to set a subset of those features during mkfs.
While the initial approach worked fine with clusters being upgraded and
fresh clusters, it could become weird in a mixed-version environment.
Backport: emperor,firefly,giant
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
John Spray [Fri, 2 Jan 2015 17:48:25 +0000 (17:48 +0000)]
tools: create cephfs-table-tool
It was unnatural to shoehorn resetting tables
into the journaltool. This new tool initially
can simply dump or reset the session/snap/ino
tables, and would also be a place for any
more complex operations in future.
John Spray [Fri, 16 Jan 2015 00:00:56 +0000 (00:00 +0000)]
mds: abstract SessionMapStore from SessionMap
This is similar to what I did for InodeStore a while back:
introduce a logical separation between the persisted attributers
(and their encoding) and the live/runtime behavioural code. This
results in a handy SessionMapStore class that can be used for
encode/decode from tools.
Also give it a reset_state method so that it matches the
prototype of the MDSTable subclasses for the benefit of
cephfs-table-tool.
We no longer convert stores on upgrade. Users coming from bobtail or
before sould go through an interim version such as cuttlefish, dumpling,
firefly or giant.
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
ceph_mon: no longer attempt store conversion on start
People upgrading from bobtail or previous clusters should first go
through an interim version (quite a few to pick from: cuttlefish,
dumpling, firefly, giant).
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
Loic Dachary [Thu, 15 Jan 2015 12:23:47 +0000 (13:23 +0100)]
tests: adapt to new json-pretty format
The json-pretty format was modified for readability and now includes
additional newlines / spaces. Either switch to json to avoid dealing
with space changes or modify the expected output to include them.
Loic Dachary [Thu, 15 Jan 2015 11:28:12 +0000 (12:28 +0100)]
common: restore format fallback semantic
When Formatter::create replaced new_formatter, the handling of an
invalid format was also incorrectly changed. When an invalid format (for
instance "plain") was specified, new_formatter returned a NULL pointer
which was sometime handled by creating a json-pretty formatter and
sometimes differently.
A new Formatter::create prototype with a fallback argument is added and
is used if it is not the empty string and that the format is not
known. This prototype is used where new_formatter returning NULL was
replaced by a json-pretty formatter.
Josh Durgin [Wed, 14 Jan 2015 23:01:38 +0000 (15:01 -0800)]
qa: ignore duplicates in rados ls
These can happen with split or with state changes due to reordering
results within the hash range requested. It's easy enough to filter
them out at this stage.
John Spray [Wed, 14 Jan 2015 10:35:53 +0000 (10:35 +0000)]
mds: handle heartbeat_reset during shutdown
Because any thread might grab mds_lock and call heartbeat_reset
immediately after a call to suicide() completes, this needs
to be handled as a special case where we tolerate MDS::hb having
already been destroyed.
Fixes: #10382 Signed-off-by: John Spray <john.spray@redhat.com>
Sage Weil [Tue, 13 Jan 2015 22:51:04 +0000 (14:51 -0800)]
mon: accumulate a single pending transaction and propose it all at once
Previous we would queue lots of distinct encoded Transactions from various
callers, usually one per PaxosService. These would be sent through paxos
one at a time.
If there is a completed transaction there is no reason to delay; it is
more efficient to push it through immediately. Since we will propose
anything pending right when we finish, there is minimal opportunity for
other work to get done.
Instead, accumulate everything in a single MonitorDBStore::Transaction and
propose all pending changes all at once. Encode at propose time and
expose the Transaction to the callers so they can add their changes.
Jason Dillaman [Tue, 13 Jan 2015 04:17:50 +0000 (23:17 -0500)]
librbd: flush pending AIO requests under all existing flush scenarios
AIO requests that are waiting on the image lock should be flushed
during all existing RBD flush scenarios. A few flush cases were
missed in the original implementation.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Mon, 3 Nov 2014 21:51:06 +0000 (16:51 -0500)]
librbd: differentiate between R/O vs R/W RBD features
The new RBD exclusive lock feature should be treated as a
feature that is only applied when the image is opened in
R/W mode.
Older clients will need to handle the updated
cls_rbd::get_features method in order to properly determine
the incompatible features for an image depending on the
current mode.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Sun, 16 Nov 2014 19:20:42 +0000 (14:20 -0500)]
librbd: Add convenience library to support unit tests
Unit tests need access to the private symbols of librbd no
longer exported from librbd.so. A new librbd_internal
convenience library was created to allow access.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Wed, 8 Oct 2014 12:41:53 +0000 (08:41 -0400)]
librbd: Integrate librbd with new exclusive lock feature
Operations that update the image now require the exclusive lock
if the feature is enabled. AIO write and discard operations will
automatically request the exclusive lock from the current leader
to support live-migration.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Matt Richards [Tue, 13 Jan 2015 00:59:42 +0000 (16:59 -0800)]
librados: bump rados version number
As a follow-on to 49d114f1fff90e5c0f206725a5eb82c0ba329376,
increment the "extra" version field so clients can easily
determine if they have a version of librados that properly
translates C API operation flags.
Signed-off-by: Matthew Richards <mattjrichards@gmail.com>
Sage Weil [Mon, 12 Jan 2015 22:00:21 +0000 (14:00 -0800)]
osd: enable filestore_extsize by default
Note that this will only get used if the kernel is new enough; if it is
older than 3.5 the option will get disabled and extsize will not be used
even if the option is set to true.
Sage Weil [Mon, 12 Jan 2015 21:59:39 +0000 (13:59 -0800)]
os/FileStore: verify kernel is new enough before using extsize ioctl
Old kernels have an XFS bug that exposes uninitialized data when the
extsize hint is set and only partially written. This is fixed by Linux
commit aff3a9edb7080f69f07fe76a8bd089b3dfa4cb5d, documented in XFS bug
http://oss.sgi.com/bugzilla/show_bug.cgi?id=874, and tested by XFS
test xfs/229 to prevent regressions.
Notably the original bug affects kernel 3.2, which is widely deployed with
ubuntu precise 12.04.
Backport: giant, firefly Signed-off-by: Sage Weil <sage@redhat.com>
John Spray [Mon, 5 Jan 2015 19:34:57 +0000 (19:34 +0000)]
mon: implement `fs reset`
This is for use in CephFS disaster recovery. When
the metadata pool has been forcibly reset to a single-MDS
metadata tree, we would like to reset the MDSMap to match.