osd_types, PGLog: encode missing based on features
Store whether the missing set should contain deletes, so that
persisted versions can be rebuilt if needed. Make missing_item
versioned, since it's persisted by the pg_log as an individual omap
value.
osd_types, Objecter: make recovery_deletes feature create a new interval
This is needed to create a single place to regenerate the missing set
- at the start of a new interval where support for recovery deletes
changed.
The missing set is otherwise not cleared, so it would need to be
rebuilt in arbitrary places if e.g. an osd not supporting it went down
and restarted with support, or if we used a feature flag command to
trigger rebuilding the missing set.
OSDMap, OSDMonitor: automatically set recovery deletes for luminous
Once the required osd release is luminous, all osds must support
recovery deletes, so set the flag then. This avoids an extra manual
step in luminous upgrades.
Josh Durgin [Fri, 30 Jun 2017 00:12:39 +0000 (20:12 -0400)]
include/ceph_features.h: add feature bit for handling deletes during recovery
The BLKIN feature bit was actually unused - it was a remnant from
earlier versions of the blkin work, but all the encoding is handled by
struct-level versioning in the version that merged.
Use bit 60 (unused in any prior version) so that recovery deletes
could potentially be backported.
Josh Durgin [Tue, 27 Jun 2017 01:45:15 +0000 (21:45 -0400)]
osd/PGLog: reset complete_to when appending lost_delete entries
Since lost_deletes queue recovery directly, and don't go through
activate_not_complete(), our complete_to iterator may still point at
log.end() (a list iterator pointing to .end() will still point to
.end() after a push_back().). Reset it to point before these new
lost_delete entries. This is needed now that lost_deletes are
performed during recovery, instead of inline when merging logs.
Josh Durgin [Wed, 21 Jun 2017 00:29:04 +0000 (20:29 -0400)]
osd/PrimaryLogPG: check whether clones missing from the cache are recovering
This appears now that deletes are not processed inline from the PG log
- a clone that is missing only on a peer (due to being deleted) would
not stop rollback from promoting the clone, resulting in hitting an
assert on the replica when the promotion tried to write to the missing
object on the replica.
This only affects cache tiering due to the dependence on the
MAP_SNAP_CLONE flag in find_object_context() - missing_oid was not being checked for being
recovered, unlike the target oid for the op (in do_op()).
Josh Durgin [Mon, 26 Jun 2017 23:00:18 +0000 (19:00 -0400)]
osd/PrimaryLogPG,PGBackend: handle deletes during recovery
Deletes are the same for EC and replicated pools, so add logic for
handling MOSDPGRecoveryDelete[Reply] to the base PGBackend
class.
Within PrimaryLogPG, add parallel paths for starting deletes,
recover_missing() and prep_object_replica_deletes(), and update the
local and global recovery callbacks to deal with lacking an
ObjectContext after a delete has been performed.
Josh Durgin [Mon, 26 Jun 2017 22:14:02 +0000 (18:14 -0400)]
osd/PG: handle deletes in MissingLoc
There's no source needed for deleting an object, so don't keep track
of this. Update is_readable_with_acting/is_unfound, and add an
is_deleted() method to be used later.
Josh Durgin [Mon, 26 Jun 2017 22:09:27 +0000 (18:09 -0400)]
osd: add a 'delete' flag to missing items and related functions
This will track deletes that were in the pg log and still need to be
performed during recovery. Note that with these deleted objects we may
not have an accurate 'have' version, since the object may have already
been deleted locally, so tolerate this when examining divergent entries.
tests: ceph-disk: use communicate() instead of wait() for output
to avoid possible deadlock. quote from doc of Popen.wait()
> This will deadlock when using stdout=PIPE and/or stderr=PIPE and the
child process generates enough output to a pipe such that it blocks
waiting for the OS pipe buffer to accept more data. Use communicate() to
avoid that.
and print out the stdout and stderr using LOG.warn() if the command
fails.
Add a set of new tests for the case when public_addr and public_bind_addr
are different for a mon. In order to test this properly I had to employ
port forwarding with socat. This helps simulate what would happen in a
environment like Kubernetes. socat is now a build dependency.
Also, moved jq_success to ceph-helpers.sh and refactored run_mon to enable
creating the mons without creating the rbd pool immediately.
To support running in dynamic enviornments (like Kubernetes) the mon needs
to be able to advertise and ip address that is different from the ip address
that it listens on locally.
Added a new config option "public_bind_addr" which if set becomes the address
that the mon will bind to locally. If empty (the default) the public_addr
will be used to bind locally.
added a new function on Messenger to set_addr which is called by ceph-mon to set
the advertised address after doing the bind.
also relaxed the "wrong node!" errors in AsyncMessenger and SimpleMessenger as
its now valid to talk to a peer whose peer_addr_of_me is different from what
we expect.
core:" Stringify needs access to << before reference" src/include/stringify.h
Clang complains:
In file included from /home/jenkins/workspace/ceph-master/src/mon/HealthMonitor.cc:21:
/home/jenkins/workspace/ceph-master/src/include/stringify.h:15:6: error: call to function 'operator<<' that is neither visible in the template definition nor found by argument-dependent lookup
ss << a;
^
/home/jenkins/workspace/ceph-master/src/mon/HealthMonitor.cc:129:32: note: in instantiation of function template specialization 'stringify<std::__1::set<std::__1::basic_string<char>, std::__1::less<std::__1::basic_string<char> >, std::__1::allocator<std::__1::basic_string<char> > > >' requested here
boost::regex("%names%"), stringify(names[p.first]));
^
/home/jenkins/workspace/ceph-master/src/include/types.h:160:17: note: 'operator<<' should be declared prior to the call site
inline ostream& operator<<(ostream& out, const set<A, Comp, Alloc>& iset) {
Signed-off-by: Willem Jan Withagen <wjw@digiware.nl>
Jos Collin [Fri, 14 Jul 2017 02:13:51 +0000 (07:43 +0530)]
crush: silence warning from -Woverflow
The following warning appears during build:
ceph/src/crush/CrushWrapper.cc: In member function ‘int32_t CrushWrapper::_alloc_class_id() const’:
ceph/src/crush/CrushWrapper.cc:1322:56: warning: integer overflow in expression [-Woverflow]
uint32_t upperlimit = numeric_limits<int32_t>::max() + 1;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
Sage Weil [Tue, 11 Jul 2017 01:21:59 +0000 (21:21 -0400)]
osd/OSDMap: remove assumption about type ids
The code is assuming type==1 is in use, but it might not be. (It is
usually 'chassis' by default, which is rarely used; 'host' is type usually
type 2.) Remove the type check entirely and identify leaves by a child
>= 0.