mon: OSDMonitor: update the osdmap's latest_full with the new full version
We used to do this on encode_full(), but since [1] we no longer rely on
PaxosService to manage the full maps for us. And we forgot to write down
the latest_full version to the store, leaving it in a truly outdated state.
Samuel Just [Fri, 19 Jul 2013 22:56:52 +0000 (15:56 -0700)]
OSD::RemoveWQ: do not apply_transaction while blocking _try_resurrect_pg
Some callbacks take the osd lock, so we need to avoid blocking an
osd lock holding thread while waiting on a filestore callback.
Instead, just queue the transaction, and allow _try_resurrect_pg
to cancel us while we are waiting for the transaction to go through
(CLEARING_WAITING).
Fixes: #5672 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Danny Al-Gaaf [Sat, 20 Jul 2013 17:36:32 +0000 (19:36 +0200)]
test_cls_version.cc: don't free object twice, free the right one
Object 'librados::ObjectWriteOperation *op' is freed twice in the TEST
test_version_inc_read. Free instead 'librados::ObjectReadOperation *rop'
Related cppcheck warning:
[src/test/cls_version/test_cls_version.cc:79]: (error) Memory
pointed to by 'op' is freed twice.
This should also fix:
CID 1049247 (#1 of 1): Use after free (USE_AFTER_FREE)
deref_arg: Calling "librados::ObjectWriteOperation::~ObjectWriteOperation()"
dereferences freed pointer "op". (The dereference happens because this is
a virtual function call.)
CID 1049218 (#4 of 4): Resource leak (RESOURCE_LEAK)
leaked_storage: Variable "rop" going out of scope leaks the storage it
points to.
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Danny Al-Gaaf [Sat, 20 Jul 2013 17:00:50 +0000 (19:00 +0200)]
rgw: change RGWOp::name() to return string instead of char*
Return 'const string' instead of 'const char *' from RGWOp::name() to
avoid the usage of std::string:c_str() to return 'const char *' in
some cases in rgw_rest_replica_log.h.
Returning result of c_str() from a function is dangerous since the
result gets (may) invalid after the related string object gets
destroyed or out of scope (which is the case with return). So you
may end up with garbage in this case.
Related warning from cppcheck:
[src/rgw/rgw_rest_replica_log.h:39]: (error) Dangerous usage of
c_str(). The value returned by c_str() is invalid after this call.
[src/rgw/rgw_rest_replica_log.h:59]: (error) Dangerous usage of
c_str(). The value returned by c_str() is invalid after this call.
[src/rgw/rgw_rest_replica_log.h:79]: (error) Dangerous usage of
c_str(). The value returned by c_str() is invalid after this call
This should also fix:
CID 1049250 (#1 of 1): Wrapper object use after free (WRAPPER_ESCAPE)
escape: The internal representation of "s" escapes, but is destroyed
when it exits scope.
CID 1049251 (#1 of 1): Wrapper object use after free (WRAPPER_ESCAPE)
escape: The internal representation of "s" escapes, but is destroyed
when it exits scope.
CID 1049252 (#1 of 1): Wrapper object use after free (WRAPPER_ESCAPE)
escape: The internal representation of "s" escapes, but is destroyed
when it exits scope.
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Sage Weil [Fri, 19 Jul 2013 23:36:01 +0000 (16:36 -0700)]
mon: OSDMonitor: only thrash and propose if we are the leader
'thrash_map' is only set if we are the leader, so we would thrash and
propose the pending value if we are the leader. However, we should keep
the 'is_leader()' check not only for clarity's sake (an unfamiliar reader
may cry OMGBUG, prompting to a patch much like this), but also because
we may lose a subsequent election and become a peon instead, while still
holding a 'thrash_map' value > 0 -- and we really don't want to propose
while being a peon.
Sage Weil [Fri, 19 Jul 2013 23:35:02 +0000 (16:35 -0700)]
mon/OSDMonitor: do not wait for readable in send_latest()
send_latest() checks for readable and, if untrue, will wait before sending
out the latest OSDMap. This is completely unnecessary; I think it is a
hold-over from when we have independent paxos states. An audit of all
callers confirms that everyone would be happy with whatever is committed,
even if we are in the process of committing an even newer version.
Effectively, everyone waits *above* this layer in the usual PaxosService
traps for whether we are readable or not. This means that waiting_for_map
and send_to_waiting() go away entirely, which is nice.
This addresses, among other things: send_to_waiting() is called from
update_from_paxos(), which can be called when we are not readable due to
the paxos commit/finish timing changes in f1ce8d7c955a24 and c711203c0d4b. If no subsequent update happens, those waiters never get
their maps.
Instead, we send them immediately--we know they are committed and old
history is as good as future history.
Fixes: #5643 Signed-off-by: Sage Weil <sage@inktank.com>
On peons, on_active() is only called when we *first* become active after an
election. Only on the leader is it called after each commit/update. This
makes this change cause other problems (broken subscriptions on peons, in
particular). We possibly should fix that, but there is also a simpler fix
for the original problem we were trying to solve.
The logic was a bit broken. Basically, we want to make sure
that region names are the same. However, if region name is not
set then we need to check whether it's the master region. This
can happen in upgrade cases where originally we didn't have
a region name set.
Multiple fixes:
- sync master, secondary entry point ver on creation
- use correct entry point version when removing entry point
- check correct version on bucket removal
was never initialized correctly anyway. It was only supposed to
be used for buckets, but it was never initialized in that case.
Using s->bucket_info.objv_tracker instead.
rgw: forward delete bucket request to master after removal
We can only forward the bucket removal to the master if it was
successfully removed locally.
The master region has no knowledge about whether the
bucket can be removed or not, e.g., there are still objects in the
bucket. If we send it to the master first, then it'll happily remove it
even though it might fail in the end.
We had a problem with bucket recreation, where we identified
that bucket has already existed, but missed the fact that it's
the same bucket, so removal of the bucket index was wrong.
Sage Weil [Thu, 18 Jul 2013 21:50:32 +0000 (14:50 -0700)]
client: mark_down by con
We have the con handy; use it. This avoids generate a spurious RESET
event, which we do not need or do anything useful with. Note that in this
case we are not attaching anything to the Connection priv field.
Sage Weil [Thu, 18 Jul 2013 21:44:17 +0000 (14:44 -0700)]
mon: break con <-> session ref cycle in mon even if shutting down
If we get a reset during shutdown, we should still break the cycle to avoid
tripping the valgrind leak detection. Note that we are touching no
internal Monitor state here and the locking has not changed.
Sage Weil [Wed, 17 Jul 2013 05:43:26 +0000 (22:43 -0700)]
msgr: generate reset event on mark_down to addr (not con)
If the caller is marking down an addr, they presumably don't have the
Connection* handy, so we should generate a reset event to help them
clean up con <-> session ref cycles.
Sage Weil [Fri, 19 Jul 2013 17:37:16 +0000 (10:37 -0700)]
mon/PGMap: avoid negative pg stats when calculating rates
We periodically see strange values come out of the estimated cluster
throughput and recovery rates. Pretty sure this is cause by feeding
negative values into the rate arithmetic and then giving the si_t
helpers mangled (sign-extended + bit shifted) values.
Sage Weil [Fri, 19 Jul 2013 17:39:17 +0000 (10:39 -0700)]
mon/PGMap: use signed values for calculated rates
si_t (and friends) does not handle signed values, but at least we can
give the Formatters unmangled values. This shouldn't happen (tm), but
if it does this will make things a bit less confusing and makes the code
a bit less fragile.
Sage Weil [Thu, 18 Jul 2013 04:52:50 +0000 (21:52 -0700)]
osd: add floor() method to pg/osd stat structs
We often want to maintain a nonnegative value. We generalize
this to floors other than zero only because it makes the function
call make intuitive sense; I don't think it is at all useful.
Sage Weil [Thu, 18 Jul 2013 23:24:00 +0000 (16:24 -0700)]
msg/Pipe: work around incorrect features reported by earlier versions
If we see a peer reporting features ~0ull, we know they are deluded in a
particular way and should infer what features they *actually* have. Do
this right when the features come over the wire to catch all users.
Fixes: #5655 Signed-off-by: Samuel Just <sam.just@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Thu, 18 Jul 2013 21:50:32 +0000 (14:50 -0700)]
client: mark_down by con
We have the con handy; use it. This avoids generate a spurious RESET
event, which we do not need or do anything useful with. Note that in this
case we are not attaching anything to the Connection priv field.
Sage Weil [Thu, 18 Jul 2013 21:44:17 +0000 (14:44 -0700)]
mon: break con <-> session ref cycle in mon even if shutting down
If we get a reset during shutdown, we should still break the cycle to avoid
tripping the valgrind leak detection. Note that we are touching no
internal Monitor state here and the locking has not changed.
Sage Weil [Wed, 17 Jul 2013 05:43:26 +0000 (22:43 -0700)]
msgr: generate reset event on mark_down to addr (not con)
If the caller is marking down an addr, they presumably don't have the
Connection* handy, so we should generate a reset event to help them
clean up con <-> session ref cycles.
Sage Weil [Thu, 18 Jul 2013 21:06:41 +0000 (14:06 -0700)]
cls_lock: fix duration test
It's possible for us to just be really slow when getting the reply to the
first op or doing the second op, resulting in a successful lock. If we
do get a success, assert that at least that amount of time has passed to
avoid any false positives.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Samuel Just [Thu, 18 Jul 2013 17:12:17 +0000 (10:12 -0700)]
FileStore: add global replay guard for split, collection_rename
In the event of a split or collection rename, we need to ensure that
we don't replay any operations on objects within those collections
prior to that point. Thus, we mark a global replay guard on the
collection after doing a syncfs and make sure to check that in
_check_replay_guard() for all object operations.
Fixes: #5154 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Wed, 17 Jul 2013 04:33:28 +0000 (21:33 -0700)]
rgw: drop unused assignment
rgw/rgw_rados.cc: In member function 'virtual int RGWPutObjProcessor_Atomic::handle_data(ceph::bufferlist&, off_t, void**)':
rgw/rgw_rados.cc:648:5: warning: parameter 'ofs' set but not used [-Wunused-but-set-parameter]
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Sage Weil [Wed, 17 Jul 2013 22:49:16 +0000 (15:49 -0700)]
mon: make 'health' warn about slow requests
Currently we see slow request warnings go by in the cluster log, but they
are not reflected by 'ceph health'. Use the new op queue histograms to
raise a flag there as well.
For example:
HEALTH_WARN 59 requests are blocked > 32 sec; 2 osds have slow requests
21 ops are blocked > 65.536 sec
38 ops are blocked > 32.768 sec
16 ops are blocked > 65.536 sec on osd.1
23 ops are blocked > 32.768 sec on osd.1
5 ops are blocked > 65.536 sec on osd.2
15 ops are blocked > 32.768 sec on osd.2
2 osds have slow requests
Fixes: #5505 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Wed, 17 Jul 2013 21:21:40 +0000 (14:21 -0700)]
osd: include op queue age histogram in osd_stat_t
This includes a simple power-of-2 histogram of op ages in the op queue
inside osd_stat_t. This can be used for a coarse view of overall cluster
performance (it will get summed by the mon), to identify specific outlier
osds who have a higher latency than the others, or to identify stuck ops.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Wed, 17 Jul 2013 22:04:10 +0000 (15:04 -0700)]
PG: start flush on primary only after we process the master log
Once we start serving reads, stray objects must have already
been removed. Therefore, we have to flush all operations
up to the transaction writing out the authoritative log.
On replicas, we flush in Stray() if we will not eventually
be activated and in ReplicaActive if we are in the acting
set. This way a replica won't serve a replica read until
the store is consistent.
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>