Kefu Chai [Thu, 12 May 2016 12:28:11 +0000 (20:28 +0800)]
osd: reset session->osdmap if session is not waiting for a map anymore
we should release the osdmap reference once we are done with it,
otherwise we might need to wait very long to update that reference with
a newer osdmap ref. this appears to be an OSDMap leak: it is held by an
quiet OSD::Session forever.
the osdmap is not reset in OSD::session_notify_pg_create(), because its
only caller is wake_pg_waiters(), which will call
dispatch_session_waiting() later. and dispatch_session_waiting() will
check the session->osdmap, and will also reset the osdmap if
session->waiting_for_pg.empty().
Sage Weil [Thu, 10 Mar 2016 14:50:07 +0000 (09:50 -0500)]
log: do not repeat errors to stderr
If we get an error writing to the log, log it only once to stderr.
This avoids generating, say, 72 GB of ENOSPC errors in
teuthology.log when /var/log fills up.
Conflicts:
src/log/Log.cc (drop m_uid and m_gid which are not used in hammer;
order of do_stderr, do_syslog, do_fd conditional blocks is reversed in
hammer; drop irrelevant speed optimization code from 5bfe05aebfefdff9022f0eb990805758e0edb1dc)
shun-s [Tue, 28 Jun 2016 07:30:16 +0000 (15:30 +0800)]
replcatedBackend: delete one useless op->mark_started as there are two in ReplicatedBackend::sub_op_modify_impl
delete one mark_start event as there are two same op->mark_started in ReplicatedBackend::sub_op_modify_impl Fixes: http://tracker.ceph.com/issues/16572 Signed-off-by: shun-s <song.shun3@zte.com.cn>
Conflicts:
src/mon/Monitor.cc (the signature of Monitor::reply_command()
changed a little bit in master, so adapt the
commit to work with the old method)
Yehuda Sadeh [Thu, 5 May 2016 21:02:25 +0000 (14:02 -0700)]
rgw: handle stripe transition when flushing final pending_data_bl
Fixes: http://tracker.ceph.com/issues/15745
When complete_writing_data() is called, if pending_data_bl is not empty
we still need to handle stripe transition correctly. If pending_data_bl
has more data that we can allow in current stripe, move to the next one.
Yehuda Sadeh [Mon, 16 May 2016 21:35:12 +0000 (14:35 -0700)]
rgw: keep track of written_objs correctly
Fixes: http://tracker.ceph.com/issues/15886
Only add a rados object to the written_objs list if the write
was successful. Otherwise if the write will be canceled for some
reason, we'd remove an object that we didn't write to. This was
a problem in a case where there's multiple writes that went to
the same part. The second writer should fail the write, since
we do an exclusive write. However, we added the object's name
to the written_objs list anyway, which was a real problem when
the old processor was disposed (as it was clearing the objects).
Kefu Chai [Mon, 9 May 2016 07:01:46 +0000 (15:01 +0800)]
osd: remove all stale osdmaps in handle_osd_map()
in a large cluster, there are better chances that the OSD fails to trim
the cached osdmap in a timely manner. and sometimes, it is just unable
to keep up with the incoming osdmap if skip_maps, so the osdmap cache
can keep building up to over 250GB in size. in this change
* publish_superblock() before trimming the osdmaps, so other osdmap
consumers of OSDService.superblock won't access the osdmaps being
removed.
* trim all stale osdmaps in batch of conf->osd_target_transaction_size
if skip_maps is true. in my test, it happens when the osd only
receives the osdmap from monitor occasionally because the osd happens
to be chosen when monitor wants to share a new osdmap with a random
osd.
* always use dedicated transaction(s) for trimming osdmaps. so even in
the normal case where we are able to trim all stale osdmaps in a
single batch, a separated transaction is used. we can piggy back
the commits for removing maps, but we keep it this way for simplicity.
* use std::min() instead MIN() for type safety
Kefu Chai [Wed, 16 Mar 2016 13:15:35 +0000 (21:15 +0800)]
osd: populate the trim_thru epoch using MOSDMap.oldest_map
instead of filling MOSDMap with the local oldest_map, we share
the maximum MOSDMap.oldest_map received so far with peers. That
way one OSD's failure to trim won't prevent it from sharing with
others that they are allowed to trim.
Brad Hubbard [Fri, 6 May 2016 05:05:42 +0000 (15:05 +1000)]
common: Add space between timestamp and "min lat:" in bench output
This change is taken from 069d95eaf49cadaa9a8fa1fa186455944a50ec7d
but I did not want to cherry-pick that patch since the rest of it
is purely cosmetic and would be unlikely to apply cleanly.
Samuel Just [Thu, 10 Mar 2016 23:19:15 +0000 (15:19 -0800)]
LFNIndex::lfn_translate: consider alt attr as well
If the file has an alt attr, there are two possible matching
ghobjects. We want to make sure we choose the right one for
the short name we have. If we don't, a split while there are
two objects linking to the same inode will result in one of
the links being orphaned in the source directory, resulting
in #14766.
Vitja Makarov [Wed, 17 Feb 2016 10:46:18 +0000 (13:46 +0300)]
hammer: rgw: S3: set EncodingType in ListBucketResult
Signed-off-by: Victor Makarov <vitja.makarov@gmail.com>
(cherry picked from commit d2e281d2beb0a49aae0fd939f9387cb2af2692c8)
X-Github-PR: 7712
Backport: hammer Signed-off-by: Robin H. Johnson <robin.johnson@dreamhost.com>
Jianpeng Ma [Sun, 22 Mar 2015 14:07:24 +0000 (22:07 +0800)]
osd/Replicated: For CEPH_OSD_OP_WRITE, set data digest.
Add two cases which can add data digest for OP_WRITE:
a: offset = 0, and length > original size
b: offset = original size, and original has data_digest.
src/rgw/rgw_bucket.cc
1. Do not use the rgw_user structure and remove the tenant parameter that describes as below
2. user_id is not used so just remove the line
3. instead of system_obj_set_attr you can use the method set_attr
Backport Change:
We do not use the rgw_user structure and remove the `tenant` parameter
because this feature is not introduced on hammer version.
The rgw multi-tenant feature is introduced on pr#6784 (https://github.com/ceph/ceph/pull/6784)
This feature is supported from v10.0.2 and later version.
Move initialization to the few tests that actually use it.
Fixes: http://tracker.ceph.com/issues/15225 Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
(cherry picked from commit 1c2831a2c1277c69f9649200d74a75c04a4b0296)
Conflicts:
src/test/msgr/perf_msgr_client.cc
src/test/msgr/perf_msgr_server.cc
src/test/perf_local.cc
These three files were not introduced on hammer, just remove
Robin H. Johnson [Thu, 31 Mar 2016 06:24:40 +0000 (06:24 +0000)]
rgw: Multipart ListPartsResult ETag quotes
ListPartsResult output has always missed quotes on the ETag since it was
first committed.
Fixes: #15334
Backports: hammer, infernalis Signed-off-by: Robin H. Johnson <robin.johnson@dreamhost.com>
(cherry picked from commit a58b774e72cc1613d62e10b25322d6d15e9d2899)
When the thrasher is in action together with a validater (lockdep or
valgrind), a single test may hang for more than 360 seconds. Increase to
1200: it does not matter if the value is large, only that it prevents
the test from hanging forever.
Vicente Cheng [Tue, 9 Feb 2016 20:03:24 +0000 (12:03 -0800)]
rgw: user quota may not adjust on bucket removal
Description:
If the user/admin removes a bucket using --force/--purge-objects options with s3cmd/radosgw-admin respectively, the user stats will continue to reflect the deleted objects for quota purposes, and there seems to be no way to reset them. User stats need to be sync'ed prior to bucket removal.
Solution:
Sync user stats before removing a bucket.
src/rgw/rgw_op.cc
reordering the check seqence and replace some op_ret to ret
Backport Change:
We remove the `tenant` parameter because this feature is not introduced on hammer version.
The rgw multi-tenant feature is introduced on pr#6784 (https://github.com/ceph/ceph/pull/6784)
This feature is supported from v10.0.2 and later version.
ceph.spec.in: disable lttng and babeltrace explicitly
before this change, we do not pacakge tracepoint probe shared libraries
on rhel7. but "configure" script enables them if lttng is detected. and
rpm complains at seeing installed but not pacakged files. as EPEL-7 now
includes lttng-ust-devel and libbabeltrace-devel, we'd better
BuildRequire them, and build with them unless disabled otherwise. so in
this change
* make "lttng" an rpm build option enabled by default
* BuildRequire lttng-ust-devel and libbabeltrace-devel if the "lttng"
"lttng" option is enabled
* --without-lttng --without-babeltrace if the "lttng" option is disabled
hammer: monclient: avoid key renew storm on clock skew
Refreshing rotating keys too often is a symptom of a clock skew, try to
detect it and don't cause extra problems:
* MonClient::_check_auth_rotating:
- detect and report premature keys expiration due to a time skew
- rate limit refreshing the keys to avoid excessive RAM and CPU usage
(both by OSD in question and monitors which have to process a lot
of auth messages)
* MonClient::wait_auth_rotating: wait for valid (not expired) keys
* OSD::init(): bail out after 10 attempts to obtain the rotating keys
the gmt_hitset is enabled by default in the ctor of pg_pool_t, this
is intentional. because we want to remove this setting and make
gmt_hitset=true as a default in future. but this forces us to
disable it explicitly when preparing a new pool if any OSD does
not support gmt hitset.
Kefu Chai [Fri, 5 Jun 2015 13:06:48 +0000 (21:06 +0800)]
osd: use GMT time for the object name of hitsets
* bump the encoding version of pg_hit_set_info_t to 2, so we can
tell if the corresponding hit_set is named using localtime or
GMT
* bump the encoding version of pg_pool_t to 20, so we can know
if a pool is using GMT to name the hit_set archive or not. and
we can tell if current cluster allows OSDs not support GMT
mode or not.
* add an option named `osd_pool_use_gmt_hitset`. if enabled,
the cluster will try to use GMT mode when creating a new pool
if all the the up OSDs support GMT mode. if any of the
pools in the cluster is using GMT mode, then only OSDs
supporting GMT mode are allowed to join the cluster.
Conflicts:
src/include/ceph_features.h
src/osd/ReplicatedPG.cc
src/osd/osd_types.cc
src/osd/osd_types.h
fill pg_pool_t with default settings in master branch.
test/bufferlist: do not expect !is_page_aligned() after unaligned rebuild
if the size of a bufferlist is page aligned we allocate page aligned
memory chunk for it when rebuild() is called. otherwise we just call
the plain new() to allocate new memory chunk for holding the continuous
buffer. but we should not expect that `new` allocator always returns
unaligned memory chunks. instead, it *could* return page aligned
memory chunk as long as the allocator feels appropriate. so, the
`EXPECT_FALSE(bl.is_page_aligned())` after the `rebuild()` call is
removed.
Sage Weil [Tue, 6 Oct 2015 18:35:35 +0000 (14:35 -0400)]
osd/PG: fix generate_past_intervals
We may be only calculating older past intervals and have a valid
history.same_interval_since value, in which case the local
same_interval_since value will end at the newest old interval we had to
generate.
mon: Monitor: get rid of weighted clock skew reports
By weighting the reports we were making it really hard to get rid of a
clock skew warning once the cause had been fixed.
Instead, as soon as we get a clean bill of health, let's run a new round
and soon as possible and ascertain whether that was a transient fix or
for realsies. That should be better than the alternative of waiting for
an hour or something (for a large enough skew) for the warning to go
away - and with it, the admin's sanity ("WHAT AM I DOING WRONG???").
When in the presence of a clock skew, adjust the checking interval
according to how many rounds have gone by since the last clean check.
If a skew is detected, instead of waiting an additional 300 seconds we
will perform the check more frequently, gradually backing off the
frequency if the skew is still in place (up to a maximum of
'mon_timecheck_interval', default: 300s). This will help with transient
skews.
Conflicts:
src/common/config_opts.h
Merge the change line.
src/mon/Monitor.h
handle_timecheck_leader(MonOpRequestRef op) was replaced with handle_timecheck_leader(MTimeCheck *m)
also for handle_timecheck_peon and handle_timecheck.
Dan Mick [Thu, 26 Nov 2015 03:20:51 +0000 (19:20 -0800)]
test/librados/test.cc: clean up EC pools' crush rules too
SetUp was adding an erasure-coded pool, which automatically adds
a new crush rule named after the pool, but only removing the
pool. Remove the crush rule as well.