By calling reweight_by_utilization() method, we are aiming at an evener result
of utilization among all osds. To achieve this, we shall decrease weights of
osds which are currently overloaded, and try to increase weights of osds which
are currently underloaded when it is possible.
However, we can't do this all at a time in order to avoid a massive pg migrations
between osds. Thus we introduce a max_osds limit to smooth the progress.
The problem here is that we have sorted the utilization of all osds in a descending
manner and we always try to decrease the weights of the most overloaded osds
since they are most likely to encounter a nearfull/full transition soon, but
we won't increase the weights from the most underloaded(least utilized by contrast)
at the same time, which I think is not quite reasonable.
Actually, the best thing would probably be to iterate over teh low and high osds
in parallel, and do the ones that are furthest from the average first.
Resolved by picking the lambda implemenation.
NOTE: Because hammer does not support C++11, the lambda functionality from the
current master has been moved into the "Sorter" function object.
Kefu Chai [Thu, 12 May 2016 12:28:11 +0000 (20:28 +0800)]
osd: reset session->osdmap if session is not waiting for a map anymore
we should release the osdmap reference once we are done with it,
otherwise we might need to wait very long to update that reference with
a newer osdmap ref. this appears to be an OSDMap leak: it is held by an
quiet OSD::Session forever.
the osdmap is not reset in OSD::session_notify_pg_create(), because its
only caller is wake_pg_waiters(), which will call
dispatch_session_waiting() later. and dispatch_session_waiting() will
check the session->osdmap, and will also reset the osdmap if
session->waiting_for_pg.empty().
Sage Weil [Thu, 10 Mar 2016 14:50:07 +0000 (09:50 -0500)]
log: do not repeat errors to stderr
If we get an error writing to the log, log it only once to stderr.
This avoids generating, say, 72 GB of ENOSPC errors in
teuthology.log when /var/log fills up.
Conflicts:
src/log/Log.cc (drop m_uid and m_gid which are not used in hammer;
order of do_stderr, do_syslog, do_fd conditional blocks is reversed in
hammer; drop irrelevant speed optimization code from 5bfe05aebfefdff9022f0eb990805758e0edb1dc)
mds: only open non-regular inode with mode FILE_MODE_PIN
ceph_atomic_open() in kernel client does lookup and open at the same
time. So it can open a symlink inode with mode CEPH_FILE_MODE_WR.
Open a symlink inode with mode CEPH_FILE_MODE_WR triggers assertion
in Locker::check_inode_max_size();
Multi-delete is triggered by a query parameter on POST, but there are
multiple valid ways of representing it, and Ceph should accept ANY way
that has the query parameter set, regardless of what value or absence of
value.
This caused the RubyGem aws-sdk-v1 to break, and has been present since
multi-delete was first added in commit 0a1f4a97da, for the bobtail
release.
Fixes: http://tracker.ceph.com/issues/16618 Signed-off-by: Robin H. Johnson <robin.johnson@dreamhost.com>
(cherry picked from commit a7016e1b67e82641f0702fda4eae799e953063e6)
shun-s [Tue, 28 Jun 2016 07:30:16 +0000 (15:30 +0800)]
replcatedBackend: delete one useless op->mark_started as there are two in ReplicatedBackend::sub_op_modify_impl
delete one mark_start event as there are two same op->mark_started in ReplicatedBackend::sub_op_modify_impl Fixes: http://tracker.ceph.com/issues/16572 Signed-off-by: shun-s <song.shun3@zte.com.cn>
rgw: Set Access-Control-Allow-Origin to a Asterisk if allowed in a rule
Before this patch the RGW would respond with the Origin send by the client in the request
if a wildcard/asterisk was specified as a valid Origin.
This patch makes sure we respond with a header like this:
Access-Control-Allow-Origin: *
This way a resource can be used on different Origins by the same browser and that browser
will use the content as the asterisk.
We also keep in mind that when Authorization is send by the client different rules apply.
In the case of Authorization we may not respond with an Asterisk, but we do have to
add the Vary header with 'Origin' as a value to let the browser know that for different
Origins it has to perform a new request.
More information: https://developer.mozilla.org/en-US/docs/Web/HTTP/Access_control_CORS
Conflicts:
src/mon/Monitor.cc (the signature of Monitor::reply_command()
changed a little bit in master, so adapt the
commit to work with the old method)
Conflicts:
src/rgw/rgw_user.cc The "if (op_state.will_purge_keys())" block was
later changed to "always purge all associated keys" by e7b7e1afc7a81c3f97976f7442fbdc5118b532b5 - keep the hammer version
Jianpeng Ma [Tue, 14 Apr 2015 01:11:58 +0000 (09:11 +0800)]
osd: Fix ec pg repair endless when met unrecover object.
In repair_object, if bad_peer is replica, it don't add soid in
MissingLoc for ec pool. If there are more bad replica for ec pool
which cause object can't recover, the later recoverying will endless.
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com> Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit d51806f5b330d5f112281fbb95ea6addf994324e)
Yehuda Sadeh [Thu, 26 Mar 2015 00:35:40 +0000 (17:35 -0700)]
rgw: identify racing writes when using copy-if-newer
When copying an object from a different zone, and copy-if-newer is
specified, if the final meta write is canceled check whether the
destinatioin that was created is actually newer than our mtime,
otherwise retry.
Yehuda Sadeh [Thu, 5 May 2016 21:02:25 +0000 (14:02 -0700)]
rgw: handle stripe transition when flushing final pending_data_bl
Fixes: http://tracker.ceph.com/issues/15745
When complete_writing_data() is called, if pending_data_bl is not empty
we still need to handle stripe transition correctly. If pending_data_bl
has more data that we can allow in current stripe, move to the next one.
Sage Weil [Fri, 6 May 2016 13:09:43 +0000 (09:09 -0400)]
osdc/Objecter: upper bound watch_check result
This way we always return a safe upper bound on the amount of time
since we did a check. Among other things, this prevents us from
returning a value of 0, which is confusing.
Yehuda Sadeh [Mon, 16 May 2016 21:35:12 +0000 (14:35 -0700)]
rgw: keep track of written_objs correctly
Fixes: http://tracker.ceph.com/issues/15886
Only add a rados object to the written_objs list if the write
was successful. Otherwise if the write will be canceled for some
reason, we'd remove an object that we didn't write to. This was
a problem in a case where there's multiple writes that went to
the same part. The second writer should fail the write, since
we do an exclusive write. However, we added the object's name
to the written_objs list anyway, which was a real problem when
the old processor was disposed (as it was clearing the objects).
Kefu Chai [Mon, 9 May 2016 07:01:46 +0000 (15:01 +0800)]
osd: remove all stale osdmaps in handle_osd_map()
in a large cluster, there are better chances that the OSD fails to trim
the cached osdmap in a timely manner. and sometimes, it is just unable
to keep up with the incoming osdmap if skip_maps, so the osdmap cache
can keep building up to over 250GB in size. in this change
* publish_superblock() before trimming the osdmaps, so other osdmap
consumers of OSDService.superblock won't access the osdmaps being
removed.
* trim all stale osdmaps in batch of conf->osd_target_transaction_size
if skip_maps is true. in my test, it happens when the osd only
receives the osdmap from monitor occasionally because the osd happens
to be chosen when monitor wants to share a new osdmap with a random
osd.
* always use dedicated transaction(s) for trimming osdmaps. so even in
the normal case where we are able to trim all stale osdmaps in a
single batch, a separated transaction is used. we can piggy back
the commits for removing maps, but we keep it this way for simplicity.
* use std::min() instead MIN() for type safety
Kefu Chai [Wed, 16 Mar 2016 13:15:35 +0000 (21:15 +0800)]
osd: populate the trim_thru epoch using MOSDMap.oldest_map
instead of filling MOSDMap with the local oldest_map, we share
the maximum MOSDMap.oldest_map received so far with peers. That
way one OSD's failure to trim won't prevent it from sharing with
others that they are allowed to trim.
Brad Hubbard [Fri, 6 May 2016 05:05:42 +0000 (15:05 +1000)]
common: Add space between timestamp and "min lat:" in bench output
This change is taken from 069d95eaf49cadaa9a8fa1fa186455944a50ec7d
but I did not want to cherry-pick that patch since the rest of it
is purely cosmetic and would be unlikely to apply cleanly.
Adam Kupczyk [Wed, 2 Mar 2016 11:31:01 +0000 (12:31 +0100)]
[MON] Fixed calculation of %USED. Now it is shows (space used by all replicas)/(raw space available on OSDs). Before it was (size of pool)/(raw space available on OSDs).
Samuel Just [Thu, 10 Mar 2016 23:19:15 +0000 (15:19 -0800)]
LFNIndex::lfn_translate: consider alt attr as well
If the file has an alt attr, there are two possible matching
ghobjects. We want to make sure we choose the right one for
the short name we have. If we don't, a split while there are
two objects linking to the same inode will result in one of
the links being orphaned in the source directory, resulting
in #14766.
Vitja Makarov [Wed, 17 Feb 2016 10:46:18 +0000 (13:46 +0300)]
hammer: rgw: S3: set EncodingType in ListBucketResult
Signed-off-by: Victor Makarov <vitja.makarov@gmail.com>
(cherry picked from commit d2e281d2beb0a49aae0fd939f9387cb2af2692c8)
X-Github-PR: 7712
Backport: hammer Signed-off-by: Robin H. Johnson <robin.johnson@dreamhost.com>
Jianpeng Ma [Sun, 22 Mar 2015 14:07:24 +0000 (22:07 +0800)]
osd/Replicated: For CEPH_OSD_OP_WRITE, set data digest.
Add two cases which can add data digest for OP_WRITE:
a: offset = 0, and length > original size
b: offset = original size, and original has data_digest.