Igor Fedotov [Mon, 3 Feb 2020 15:50:50 +0000 (18:50 +0300)]
os/bluestore: do not use 'unused' bitmap if makes no sense.
The processing logic which relies on 'unused' bitmap makes sense for
bluestore setup where min alloc size is different from device block
size. Now omitting if that's not true.
Igor Fedotov [Mon, 3 Feb 2020 15:36:21 +0000 (18:36 +0300)]
os/bluestore: fix unused 'tail' calculation.
Fixes: https://tracker.ceph.com/issues/41901 Signed-off-by: Igor Fedotov <ifedotov@suse.com>
(cherry picked from commit c91cc3a8d689995e8554c41c9b0f652d9a3458da)
Conflicts:
src/test/objectstore/store_test.cc
- omitted test case "TEST_P(StoreTestSpecificAUSize, ReproBug41901Test)"
from the backport, because nautilus does not have the
"bluestore_debug_enforce_settings" option
Matthew Oliver [Wed, 26 Feb 2020 06:15:22 +0000 (06:15 +0000)]
rgw: anonomous swift to obj that dont exist should 401
Currently, if you attempt to GET and object in the Swift API that
doesn't exist and you don't pass a `X-Auth-Token` it will 404 instead of
401.
This is actually a rather big problem as it means someone can leak data
out of the cluster, not object data itself, but if an object exists or
not.
This is caused by the SwiftAnonymousEngine's, frankly wide open
is_applicable acceptance. When we get to checking the bucket or object
for user acceptance we deal with it properly, but if the object doesn't
exsit, because the user has been "authorised" rgw returns a 404.
Why? Because we always override the user with the Swift account.
Meaning as far as checks are concerned the auth user is the user, not
and anonymous user.
I assume this is because a swift container could have world readable
reads or writes and in slight s3 and swift api divergents can make these
interesting edge cases leak in.
This patch doesn't change the user to the swift account if they are
anonymous. So we can do some anonymous checks when it suits later in the
request processing path.
Fixes: https://tracker.ceph.com/issues/43617 Signed-off-by: Matthew Oliver <moliver@suse.com>
(cherry picked from commit b03d9754e113d24221f1ce0bac17556ab0017a8a)
Conflicts:
src/rgw/rgw_swift_auth.h
- where master has "rgw_user(s->account_name)", nautilus has
"s->account_name" only
Laura Paduano [Wed, 13 May 2020 12:16:57 +0000 (14:16 +0200)]
Merge pull request #34450 from rhcs-dashboard/wip-44980-nautilus
nautilus: monitoring: Fix pool capacity incorrect
Reviewed-by: Alfonso Martínez <almartin@redhat.com> Reviewed-by: Patrick Seidensal <pseidensal@suse.com> Reviewed-by: Laura Paduano <lpaduano@suse.com>
Sage Weil [Thu, 27 Feb 2020 15:30:27 +0000 (09:30 -0600)]
compressor/lz4: rebuild if buffer is not contiguous
In older versions of lz4 (specifically < 1.8.2) bit errors
can be introduced when compressing from fragmented memory. The lz4
bug was fixed by this lz4 commit:
The error can be reproduced using following command :
./frametest -v -i100000000 -s1659 -t31096808
It's actually a bug in the stream LZ4 API,
when starting a new stream
and providing a first chunk to complete with size < MINMATCH.
In which case, the chunk becomes a dictionary.
No hash was generated and stored,
but the chunk is accessible as default position 0 points to dictStart,
and position 0 is still within MAX_DISTANCE.
Then, next attempt to read 32-bits from position 0 fails.
The issue would have been mitigated by starting from index 64 KB,
effectively eliminating position 0 as too far away.
The proper fix is to eliminate such "dictionary" as too small.
Which is what this patch does.
This is a workaround to rebuild our input buffer into a continguos buffer
if it is not already contiguous.
Dan van der Ster [Wed, 26 Feb 2020 20:50:07 +0000 (21:50 +0100)]
test/compressor: test round trip of an osdmap
Check if the compressors can compress/decompress a bufferlist which is not word
aligned, such as a freshly-encoded osdmap.
Related-to: https://tracker.ceph.com/issues/39525 Signed-off-by: Dan van der Ster <daniel.vanderster@cern.ch>
(cherry picked from commit 1b1c71a2c28c38d3e28f006b1cb164435a653c02)
Conflicts:
qa/suites/rbd/openstack/workloads/devstack-tempest-gate.yaml
- some difference compared to master, but the entire test is being deleted so
I didn't examine it further
Or Friedmann [Wed, 4 Sep 2019 13:34:52 +0000 (16:34 +0300)]
fix rgw lc does not delete objects that do not have exactly the same tags as the rule
It is possible that object will have multiple tags more than the rule that applied on.
Object is not being deleted if not all tags exactly the same as in the rule.
S3-tests: ceph/s3-tests#303 Fixes: https://tracker.ceph.com/issues/41652 Signed-off-by: Or Friedmann <ofriedma@redhat.com>
(cherry picked from commit ebb806ba83fa9d68f14194b1f9886f21f7195a3d)
mon/OSDMonitor: ensure lec only accounts for up osds
If we also consider down osds, we may very well be in a healthy state
but keeping maps as far back as the last epoch when a given osd went
down. If said osd stays down for eons, we will be keeping bajillions of
maps that we shouldn't.
J. Eric Ivancich [Fri, 10 Jan 2020 19:12:35 +0000 (14:12 -0500)]
rgw: clean up address 0-length listing results...
Some minor clean-ups to the previous commit, including adjust logging
messages, rename variable, convert a #define to a constexpr (and
adjust its scope).
J. Eric Ivancich [Thu, 13 Feb 2020 01:38:44 +0000 (20:38 -0500)]
rgw: address 0-length listing results when non-vis entries dominate
A change to advance the marker in RGWRados::cls_bucket_list_ordered to
the last entry visited rather than the final entry in list to push
progress as far as possible.
Since non-vis entries tend to cluster on the same shard, such as
during incomplete multipart uploads, this can severely limit the
number of entries returned by a call to
RGWRados::cls_bucket_list_ordered since once that shard has provided
all its members, we must stop. This interacts with a recent
optimization to reduce the number of entries requested from each
shard. To address this the number of attempts is sent as a parameter,
so the number of entries requested from each shard can grow with each
attempt. Currently the growth is linear but perhaps exponential growth
(capped at number of entries requested) should be considered.
Previously RGWRados::Bucket::List::list_objects_ordered was capped at
2 attempts, but now we keep attempting to insure we make forward
progress and return entries when some exist. If we fail to make
forward progress, we log the error condition and stop looping.
Additional logging, mostly at level 20, is added to the two key
functions involved in ordered bucket listing to make it easier to
follow the logic and address potential future issues that might arise.
Additionally modify attempt number based on how many results were
received.
Change the per-shard request number, so it grows exponentially rather
than linearly as the attempts go up.
J. Eric Ivancich [Mon, 14 Oct 2019 20:21:35 +0000 (16:21 -0400)]
rgw: reduce per-shard entry count during ordered bucket listing
Currently, if a client requests the 1000 next entries from a bucket,
each bucket index shard will receive a request for the 1000 next
entries. When there are hundreds, thousands, or tens of thousands of
bucket index shards, this results in a huge amplification of the
request, even though only 1000 entries will be returned.
These changes reduce the per-bucket index shard requests. These also
allow re-requests in edge cases where all of one shard's returned
entries are consumed. Finally these changes improve the determination
of whether the resulting list is truncated.
Kefu Chai [Wed, 6 May 2020 07:48:12 +0000 (15:48 +0800)]
qa/suites/upgrade: disable min pg per osd warning
disable the TOO_FEW_PGS warning, as 1ac34a5ea3d1aca299b02e574b295dd4bf6167f4 is not backported to mimic, we
will have TOO_FEW_PGS warnings when a healthy cluster is expected when
upgrading from mimic.
this change disables this warning by setting "mon_pg_warn_min_per_osd" to
"0".
this change is not cherry-picked from master. as 1ac34a5ea3d1aca299b02e574b295dd4bf6167f4 is already included by master,
and we don't perform upgrade from mimic on master branch.
mon/OSDMonitor: Always tune priority cache manager memory on all mons
Always call into priority cache manager (pcm) to tune the memory on the
leader and on all the followers (peons) as part of each tick(). This
ensures that the pcm on all the mons continuously tunes the tcmalloc
memory thereby ensuring that the peons don't run out of memory eventually.
Vikhyat Umrao [Fri, 30 Aug 2019 07:16:46 +0000 (00:16 -0700)]
radosgw-admin: add support for --bucket-id in bucket stats command Fixes: https://tracker.ceph.com/issues/41061 Signed-off-by: Vikhyat Umrao <vikhyat@redhat.com>
(cherry picked from commit 4cd16e13ca0c8709091737ad2cb2b37a3b19840d)
Conflicts:
src/rgw/rgw_admin.cc
nautilus uses opt_cmd == OPT_BUCKET_STATS
nautilus does not have store->ctl()->meta.mgr
use store->meta_mgr
src/rgw/rgw_bucket.cc
nautilus has different declaration for RGWBucket::link
nautlis can not take nullptr in rgw_bucket_parse_bucket_key
use &shard_id
src/rgw/rgw_bucket.h
nautilus does not have set_tenant() add set_tenant()
nautilus does not have get_tenant() add get_tenant()
Kefu Chai [Sat, 2 May 2020 15:54:44 +0000 (23:54 +0800)]
qa/tasks/ceph.py: do not use option mimic does not understand
mimic does not have `mon-client-directed-command-retry` option, so we
should not pass this option to a mimic ceph client.
in this change, we fall back to plain retry, if command fails. this
change is not cherry-picked from master. as we don't run upgrade test
from mimic to master.
Dan Hill [Wed, 15 Apr 2020 21:54:09 +0000 (14:54 -0700)]
rados: prevent ShardedOpWQ suicide_grace drop when waiting for work.
The Sharded OpWQ will opportunistically wait for more work when
processing an empty queue. While waiting, the default work queue
heartbeat timeout and suicide_grace values are modified. The
`threadpool_default_timeout` grace is applied and suicide_grace is
disabled. If this op hangs, the heartbeat watchdog will not trigger an
OSD suicide recovery.
The default work queue values for grace and suicide_grace are re-applied
after finding work. This keeps the heartbeat timeouts consistent with
the values applied on _process() entry.
Fixes: https://tracker.ceph.com/issues/45076 Signed-off-by: Dan Hill <daniel.hill@canonical.com>
(cherry picked from commit 85f6e8d29cd8d0d30b3f07b26974357d875b6908)
Kefu Chai [Fri, 1 May 2020 06:06:56 +0000 (14:06 +0800)]
qa/suites/rados: use default objectsize for upgrade tests
in pre-nautilus release, rados cli does not accept `-O` option, so we
should not pass `-O` to this tool. otherwise we will have following
failure:
```
2020-05-01T05:47:04.863 INFO:tasks.radosbench.radosbench.2.smithi183.stderr:unrecognized command -O; -h or --help for usage
2020-05-01T05:47:04.865 DEBUG:teuthology.orchestra.run:got remote process result: 1
```
this change is not cherry-picked from master. as we don't perfor
upgrade tests from pre-nautilus releases
qa/tasks: do not cancel pending pg num changes on mimic
mimic does not support auto split/merge, but we do test mimic-x on
nautilus, which ends up with failures like:
ceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/contextutil.py", line 34, in nested
yield vars
File "/home/teuthworker/src/git.ceph.com_ceph_nautilus/qa/tasks/ceph.py", line 1928, in task
ctx.managers[config['cluster']].stop_pg_num_changes()
File "/home/teuthworker/src/git.ceph.com_ceph_nautilus/qa/tasks/ceph_manager.py", line 1806, in stop_pg_num_changes
if pool['pg_num'] != pool['pg_num_target']:
KeyError: 'pg_num_target'
so we need to skip this if 'pg_num_target' is not in pg_pool_t::dump().
this change is not cherry-picked from master, as we don't test
mimic-x on master.
Also, mon_pg_warn_min_per_osd is disabled by default now (or set to a
low value in vstart/testing) so there's no need to base the pg count on
this value.
Ideally someday we can remove this so that the default cluster value is
used but we need to keep this for deployments of older versions of Ceph.
Fixes: https://tracker.ceph.com/issues/42228 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit fc88e6c6c55402120a432ea47f05f321ba4c9bb1)
Conflicts:
qa/tasks/cephfs/filesystem.py: this commit was orignally
backported by #34055, but it failed to cherry-pick all necessary
bits. in this change, the missing bit is picked up.
liupengs [Sun, 17 Nov 2019 15:03:07 +0000 (23:03 +0800)]
msg/async/rdma: fix bug event center is blocked by rdma construct connection for transport ib sync msg
We construct a tcp connection to transport ib sync msg, if the
remote node is shutdown (shutdown by accident), the net.connect will be blocked until timeout
is reached, which cause the event center be blocked.
This bug may cause mon probe timeout and osd not reply, and so on.