Sage Weil [Thu, 1 Oct 2015 18:50:34 +0000 (14:50 -0400)]
osdc/Objecter: distinguish between multiple notify completions
We may send a notify to the cluster multiple times due to OSDMap
changes. In some cases, earlier notify attempts may complete with
an error, while later attempts succeed. We need to only pay
attention to the most-recently send notify's completion.
Do this by making note of the notify_id in the initial ACK (only
present when talking to newer OSDs). When we get a notify
completion, match it against our expected notify_id (if we have
one) or else discard it.
This is important because in some cases an early notify completion
may be an error while a later one succeeds.
Note that if we are talking to an old cluster we will simply not record a
notify_id and our behavior will be the same as before (we will trust any
notify completion we get).
Fixes: #13114 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Thu, 1 Oct 2015 18:50:00 +0000 (14:50 -0400)]
osd: reply to notify request with our unique notify_id
The OSD assigns a unique ID to each notify it queues for
processing. Include this in the reply to the notifier so that
they can match it up with the eventual completions they receive.
This is necessary to distinguish between multiple completions
they may receive if there is PG peering and the notify is resent.
In particular, an earlier notify may return an error when a later
attempt succeeds.
This is forwards and backwards compatible: new clients will make use of
this reply payload but older clients ignore it.
ReplicatedPG: clearing a whiteout should create the object
This was uncovered by 75321943729f1d5dfacb68645e3c5483740d66f8. Since
rbd_create() does a stat, the obc is cached as a whiteout, and the
subsequent create(EXCL) would fall through to return false from
maybe_create_new_object(). This would then skip adding a touch() to
the transaction.
ceph-objectstore-tool: delete ObjectStore::Sequencer after umount
An ObjectStore::Sequencer provided to an ObjectStore must not be
deallocated before umount. The ObjectStore::Sequencer may hold a pointer
to the instance with no reference counting. If a Context completes after
the ObjectStore::Sequencer is deleted, it could try to use it and fail.
Samuel Just [Fri, 25 Sep 2015 01:35:39 +0000 (18:35 -0700)]
OSDMap: fill in known encode_features where possible
Otherwise, if we get an incremental from hammer (struct_v = 6) we will
encode the full map as if it were before CEPH_FEATURE_PGID64, which
was actually pre-argonaut. Similarly, if struct_v >= 7, we know it
was encoded with CEPH_FEATURE_OSDMAP_ENC.
Fixes: #13234
Backport: hammer Signed-off-by: Samuel Just <sjust@redhat.com>
John Spray [Fri, 25 Sep 2015 12:02:56 +0000 (13:02 +0100)]
client: refactor quota check functions
Generalise the path traversal into check_quota_condition
and then call that from each of file_exceeded, bytes_exceeded,
bytes_approaching with the appropriate lambda function.
Motivated by fixing the path traversal and wanting to
only fix it in one place.
Ruifeng Yang [Fri, 25 Sep 2015 02:18:11 +0000 (10:18 +0800)]
Objecter: maybe access wild pointer(op) in _op_submit_with_budget.
look at "after giving up session lock it can be freed at any time by response handler" in _op_submit,
so the _op_submit_with_budget::op maybe is wild after call _op_submit.
Fixes: #13208 Signed-off-by: Ruifeng Yang <yangruifeng.09209@h3c.com>
Sage Weil [Wed, 23 Sep 2015 14:25:30 +0000 (10:25 -0400)]
osd: do full check in do_op
1. The current pool_last_map_marked_full tracking is buggy.
2. We need to recheck this each time we consider the op, not just when it
is received off the wire. Otherwise, we might get a message, queue it
for some reason, get a map indicating the cluster or pool is full, and
then requeue and process the op instead of discarding it.
3. For now, silently drop ops when failsafe check fails. This will lead to
stalled client IO. This needs a more robust fix.
Sage Weil [Thu, 24 Sep 2015 23:02:21 +0000 (19:02 -0400)]
osdc/Objecter: set FULL_FORCE flag when honor_full is false
This currenty only applies to the MDS. Eventually we can remove the
OSD MDS checks once we are confident all MDS instances are new enough
to set this flag.
Sage Weil [Thu, 24 Sep 2015 15:38:41 +0000 (11:38 -0400)]
osd/PG: compensate for sloppy hobject scrub bounds from hammer
Hammer is sloppy about the hobject_t's it uses for the scrub bounds in that
the pool isn't set. (Hammer FileStore doesn't care, but post-hammer is
much more careful about this sort of thing.)
Compensate by setting the pool on any scrub messages we receive.
Rare outside of vstart clusters, but if someone did
ever have one of these events in their journal and
try to update to latest ceph, they would end up
with bogus expire_pos on the reformatted events.
build/ops: make dist needs files with names > 99 characters
When running make distdir=ceph-9.0.3-1870-gfd861bb dist, a few files
have names longer than 99 characters and discarded, which then causes
the resulting tarbal to be incomplete:
tar: ceph-9.0.3-1870-gfd861bb/src/rocksdb/utilities/write_batch_with_index/write_batch_with_index_internal.cc: file name is too long (max 99); not dumped
tar: ceph-9.0.3-1870-gfd861bb/src/rocksdb/utilities/write_batch_with_index/write_batch_with_index_internal.h: file name is too long (max 99); not dumped
Use the tar-ustar format instead of the legacy v7
format (http://www.gnu.org/software/automake/manual/automake.html#Options). It
is unlikely machines with a C++11 compiler also have an antique tar
binary that would only support v7.
The only field actually relevant from this structure is .begin, which
duplicates the information in hit_set_start_stamp less well. The problem
is that the starting point of the currently open hit set is ephemeral
state which shouldn't go into the pg_info_t structure.
This also caused 13185 since pg_info_t.hit_set.current_info gets default
constructed with use_gmt = true regardless of the pool setting. This
becomes a problem in hit_set_persist since the oid is generated using
the pool setting, rather than the use_gmt value in current_info which
is placed into the history list. That discrepancy then causes a crash
in hit set trim. There would also be a related bug if the pool setting
is changed between when current_info is constructed and when it is
written out.
Since current_info isn't actually useful, I'm removing it so that we
don't later rely on invalid fields.
Fixes: 13185 Signed-off-by: Samuel Just <sjust@redhat.com>
Sage Weil [Wed, 23 Sep 2015 14:58:01 +0000 (10:58 -0400)]
mon/Elector: do a trivial write on every election cycle
Currently we already do a small write when the *first* election in
a round happens (to update the election epoch). If the backend
happens to fail while we are already in the midst of elections,
however, we may continue to call elections without verifying we
are still writeable.