Greg Farnum [Fri, 24 Jan 2014 19:07:13 +0000 (11:07 -0800)]
mon: do not use CEPH_FEATURES_ALL for things that touch the disk
We want to encode with our quorum_features instead. Remaining uses of
CEPH_FEATURES_ALL are:
1) when the Elector is sharing its supported features
2) in a MonMap function which is used by monmaptool
3) In the Monitor for winning a standalone election, for ephemeral data,
and for doing mkfs (when we necessarily don't have quorum_features).
4) When doing ceph-mon --inject-monmap (we don't persist the quorum_features
to disk, so we can't use them here).
5) in MMonElection, for doing the default monmap encoding (which is
re-encoded later if the final features don't match CEPH_FEATURES_ALL).
6) As the default encode features for the OSDMap (the monitor always
supplies quorum_features instead).
Greg Farnum [Thu, 23 Jan 2014 22:52:40 +0000 (14:52 -0800)]
Elector: send an OP_NAK MMonElection to old peers who support it
Only new monitors support receiving OP_NAK from a peer without crashing, but
when we add new required features in the future, our monitors can accept
an OP_NAK message which tells them what features they're missing. Then they
will print out an error message and shut down.
(Unfortunately, doing a clean shutdown from here would require a lot of
infrastructure, so we just call exit(0).)
Greg Farnum [Thu, 23 Jan 2014 21:08:26 +0000 (13:08 -0800)]
Elector: ignore messages from mons without required feature capabilities
We maintain a list of required_features which the other monitor's features
must supply. This starts out at 0 and is initialized from the monitor's
list of features whenever we start electing.
Despite the scary sound of "just ignore it", this is safe: the monitor
will only record features as required once a quorum has formed in which
every monitor supports them. After that happens, monitors which do not
support those features will be unable to read the whole mon store/understand
the pg reports/whatever else, so letting them into the quorum would be buggy
behavior.
So if we ignore a monitor, it will not be able to start nor join
an election round with anybody who was in our quorum -- that is, the
ignored monitor cannot form a separate quorum. By ignoring it here, we
also prevent it from endlessly calling elections against the real
quorum.
Unfortunately there is no way to communicate to old monitors that they
cannot join the quorum -- there are no existing messages for that purpose,
and eg adding a new op to the MMonElection message will just cause it
to crash, which we don't want to do either.
Greg Farnum [Thu, 23 Jan 2014 02:01:11 +0000 (18:01 -0800)]
Monitor: introduce a function that translates quorum features into disk features
After an election, we call apply_quorum_to_compatset_features(). It translates
from the quorum features into monitor disk state features we care about and
adds them to the disk store. This prevents an older daemon from starting up
on a store it doesn't understand.
While strictly speaking we don't need to add the EC feature until we create
an EC pool, all monitors which speak the new OSDMap encoding also support EC
pools, and having more than one feature makes the pattern clearer.
Sage Weil [Thu, 23 Jan 2014 17:16:54 +0000 (09:16 -0800)]
osd/OSDMap: do not create erasure rule by default
If we do, we will require the v2 feature bit from clients.
We could only include feature bits for rules that are actually referenced
by pools, but for now making the user create the rule is simpler. There is
no need to create this rule ahead of time.
Samuel Just [Sun, 19 Jan 2014 09:17:49 +0000 (01:17 -0800)]
PG: drop messages from down peers
This overlaps with the existing old_peering_msg() mechanism
except in one case: pulls from a replica not in the acting
set. If such a replica gets marked down, we may resend
pulls to another replica without causing a new interval
to start. If we recieved, but didn't process, a push in
response to such a pull prior to processing the map marking
the peer down, we might process the push after having reset
the pull state for a different pull operation. We can
avoid this by discarding ops from down peers.
Samuel Just [Thu, 16 Jan 2014 20:04:01 +0000 (12:04 -0800)]
PG::calc_acting: consider newest_update_osd when choosing backfill peers
We must include newest_update_osd->second.log_tail when considering backfill
peers because in GetLog we will request logs back to the min last_update over
our acting_backfill set. This will result in our log being extended as far
backwards as necessary to pick up any peers which can be log recovered by the
union of newest_update_osd's log and that of the chosen primary.
Samuel Just [Sat, 7 Dec 2013 22:52:49 +0000 (14:52 -0800)]
ReplicatedPG: handle removing the old object in finish_copy_op
do_osd_ops will need to either copy the old version out of the
way or simply delete it depending on mod_desc. Thus, defer
handling filling that part in until we finish the copy op.
Samuel Just [Mon, 9 Dec 2013 03:36:51 +0000 (19:36 -0800)]
PGLog: create interface allowing interface user to cleanup/rollback
We need to be able to allow the PGLog interface user to provide
logic for rolling back and trimming log entries. To that end,
serveral PGLog methods now take a LogEntryHander.
In PGLog::merge_old_entry, if prior_version > info.log_tail and
the object is not missing, we must have rolled back the prior
log entry. Thus, we don't skip the entry.
To simplify the code, _merge_old_entry has been split out as
a const helper. This way, proc_replica_log can be reexpressed
as merging the divergent replica log entries with the fully
merged authoritative log.
Samuel Just [Thu, 10 Oct 2013 23:12:10 +0000 (16:12 -0700)]
ReplicatedBackend: implement RPGTransaction
RPGTransaction is essentially a wrapped ObjectStore::Transaction.
The coll_t argument is elided, tempness is instead encoded in the
hobject. RPGTransaction tracks which temp objects are created and
cleared so we can update the ReplicatedBackend tracking and possibly
create the temp collection as needed.
Samuel Just [Thu, 10 Oct 2013 23:10:36 +0000 (16:10 -0700)]
hobject_t/ReplicatedPG: tempness is now an hobject thing
PGBackend implmentations will have complete control over the temp
collection. Rather than specifying the collection when sending
ops into the PGBackend, hobjects themselves will be temp or not.
Samuel Just [Thu, 5 Dec 2013 00:06:17 +0000 (16:06 -0800)]
test/osd: restructure Object/RadosModel in prep for append
Attribute handling no longer has special support in ContentsGenerator.
The most recent operation information is now stored in a special
attr rather than at the beginning of the object. ObjectDesc layers
include their own ContentsGenerators to allow more flexibility.
Also, writes truncate to the new object size rather than simply
causing reads to stop at that object size.
Samuel Just [Fri, 22 Nov 2013 19:20:23 +0000 (11:20 -0800)]
OSDMonitor: add debug_fake_ec_pool
This flag will cause ReplicatedPG to act as though the
pool were actually an EC pool in that operations will
be restricted to operations which can be locally rolled
back thereby allowing us to test the ReplicatedPG local
log rollback mechanisms independent of EC. It will also
cause ReplicatedPG to use the async read mechanism on
the PGBackend implementation once it is implemented.
Samuel Just [Sat, 7 Dec 2013 21:19:49 +0000 (13:19 -0800)]
PGLog: don't move up log.tail
Moving up log.tail unnecessarily risks backfilling
a replica after a split. Also, it disrupts the
property that replicas from the most recent interval
which performed writes must have overlapping logs.
Ilya Dryomov [Wed, 22 Jan 2014 15:33:39 +0000 (17:33 +0200)]
MOSDMap: reencode maps if target doesn't have OSDMAP_ENC
Reencode both full and incremental maps if target doesn't know how to
decode OSDMAP_ENC maps (CEPH_FEATURE_OSDMAP_ENC bit is not set). This
fixes a compatibility bug that was introduced in 3d7c69fb0986 ("OSDMap:
add a CEPH_FEATURE_OSDMAP_ENC feature, and use new encoding").
Kai Zhang [Sat, 18 Jan 2014 20:17:10 +0000 (12:17 -0800)]
Missing a key for perm 'w' in permmap (src/pybin/ceph_rest_api.py:277)
It leads to a 500 error when getting mds help info via rest api.
Changed "w" to "rw" in MonCommands.h
Fixes: #7180 Signed-off-by: Kai Zhang <kazhang2@cisco.com>