Zac Dover [Tue, 29 Jun 2021 11:45:22 +0000 (21:45 +1000)]
doc/cephadm: improving "Starting the Upgrade"
This PR (slightly) improves the text in the section "Starting
the Upgrade" in the "Upgrading Ceph" chapter of the cephadm
documentation.
This is a very minor update, and does little but bring the sentences
into agreement with many other sentences that I've already written.
This is done to give the reader an almost tabular sense of what to
expect when looking at our docs.
胡玮文 [Sun, 13 Jun 2021 06:23:56 +0000 (14:23 +0800)]
cephadm: workaround unit replace failure
This should be a bug in systemd. It failed to cleanup cgroups when stop the
unit. Then if we start a new unit with the same name, the 'ExecStartPre' command
will fail with status=219/CGROUP (Only when systemd unified cgroup hierarchy is
enabled), because cgroup v2 does not allow process in non-leaf group. This
should be fixed in systemd commit e08dabfec7304dfa0d59997dc4219ffaf22af717.
By now, we just remove these left over cgroups before start new unit.
Michael Fritch [Wed, 16 Jun 2021 20:20:38 +0000 (14:20 -0600)]
cephadm: fix regexp to strip `v1:` or `v2:` prefix from IPv6 addr
regexp was striping the first hextet of the IPv6 address:
```
FAILED tests/test_cephadm.py::TestBootstrap::test_mon_addrv[[0000:0000:0000:0000:0000:FFFF:C0A8:0101:1234]-list_networks5-None] - cephadm.Error: Cannot infer CIDR network for mon IP `0000:0000:0000:0000:FFFF:C0A8:0101`: pass --skip-mon-network to configure it later
```
(1) improves syntax and formatting of "Logging to stdout"
(2) improves syntax and formatting of "Logging to files"
(3) replaces all carets with tildes in 3rd-level section
headers in operations.rst (./build-doc was crying
about inconsistency when I fed it tildes, but tildes
and not carets are the RST standard according to
https://docutils.sourceforge.io/ \
docs/user/rst/quickstart.html#sections
so the carets had to go.
Sebastian Wagner [Tue, 15 Jun 2021 09:24:34 +0000 (11:24 +0200)]
pyhton-common: fix mypy errors
Fixes:
```
py3 run-test: commands[2] | mypy --config-file=../mypy.ini -p ceph
ceph/deployment/service_spec.py: note: In member "yaml_representer" of class "ServiceSpec":
ceph/deployment/service_spec.py:659: error: Argument 1 to "represent_dict" of "SafeRepresenter" has incompatible type "_OrderedDictItemsView[str, Any]"; expected "Mapping[Any, Any]"
```
Sebastian Wagner [Tue, 15 Jun 2021 08:19:40 +0000 (10:19 +0200)]
mgr/orch: fix mypy errors
Fixes:
```
orchestrator/__init__.py:6: note: In module imported here:
orchestrator/_interface.py: note: In member "yaml_representer" of class "DaemonDescription":
orchestrator/_interface.py:1039: error: Argument 1 to "represent_dict" of "SafeRepresenter" has incompatible type "ItemsView[Any, Any]"; expected "Mapping[Any, Any]"
orchestrator/_interface.py: note: In member "yaml_representer" of class "ServiceDescription":
orchestrator/_interface.py:1178: error: Argument 1 to "represent_dict" of "SafeRepresenter" has incompatible type "ItemsView[Any, Any]"; expected "Mapping[Any, Any]"
orchestrator/_interface.py: note: At top level:
orchestrator/_interface.py:1181: error: Argument 2 to "add_representer" has incompatible type "Callable[[SafeDumper, DaemonDescription], Any]"; expected "Callable[[SafeDumper, ServiceDescription], Node]"
Found 3 errors in 1 file (checked 29 source files)
```
limit the number of concurrent RGWMetaSyncSingleEntryCRs that each mdlog
shard is allowed to spawn. use META_SYNC_SPAWN_WINDOW=20 to match data-
and bucket sync
rgw: metadata sync treats all errors as 'transient'
collect_children() had a special case for EAGAIN that it treated as
a 'transient' error, which set the can_adjust_marker = false to bail out
of RGWMetaSyncShardCR and retry from the previous marker
but the http client doesn't return EAGAIN - rgw_http_error_to_errno()
defaults to EIO - so this retry logic based on can_adjust_marker never
runs. on any other error, RGWMetaSyncSingleEntryCR would not call
marker_tracker->finish() to advance the sync status marker, and
RGWMetaSyncShardCR would continue on with full- or incremental sync
without ever attempting to retry the failed entries
a detailed comment in collect_children() describes a different strategy
for handling 'permanent' errors, but that was never fully elaborated.
i also don't think there's a reasonable way to differentiate between
transient and permanent errors, so this treats all errors as transient
to be retried
if an error really is permanent for a given metadata key, metadata sync
will get stuck there and require manual intervention
luo rixin [Tue, 29 Dec 2020 06:39:21 +0000 (14:39 +0800)]
rgw/rgw_file: Fix the return value of read() and readlink()
Fixes: https://tracker.ceph.com/issues/49189 Signed-off-by: Dai zhiwei <daizhiwei3@huawei.com> Signed-off-by: luo rixin <luorixin@huawei.com>
(cherry picked from commit bfd83e8fa142873a0bdf09a4d1ad1b04127f5885)
test/rgw: fix use of poll() with timers in unittest_rgw_dmclock_scheduler
the AsyncScheduler uses an asio timer to dispatch work to its executor
with an optional delay. when no delay is requested, it waits on the
timer with an expiration time in the past (crimson::dmclock::TimeZero)
tests are failing here because poll() is returning without executing the
handlers of those expired timers
asio implements these timers with timerfd and epoll. debugging with
strace, i see that these timers armed with timerfd_settime() are not
always immediately ready according to epoll_wait():
rgw: read_obj_policy() consults iam_user_policies on ENOENT
when the head object doesn't exist, read_obj_policy() has to decide
whether to return ENOENT or EACCES
when there's a bucket policy, we check whether it has s3ListBucket
permissions. when there's an assumed role, we also need to check
against the role's policies in s->iam_user_policies
J. Eric Ivancich [Tue, 15 Jun 2021 19:20:33 +0000 (15:20 -0400)]
rgw: when deleted obj removed in versioned bucket, extra del-marker added
After initial checks are complete, this will read the OLH earlier than
previously to check the delete-marker flag and under the bug's
conditions will return -ENOENT rather than create a spurious delete
marker.
J. Eric Ivancich [Wed, 14 Apr 2021 17:55:22 +0000 (13:55 -0400)]
rgw: during reshard lock contention, adjust logging
When RGW fails to get a lock on a reshard log, we log it in such a way
that it looks like an error. Instead we'll make sure that the log
message is informational.
Adam C. Emerson [Fri, 16 Jul 2021 15:20:39 +0000 (11:20 -0400)]
rgw: radosgw-admin errors if marker not specified on data/mdlog trim
Check that a marker was specified and trim if we don't have one.
Also: In a world where we're parsing for generation, it doesn't really
make sense to have a 'no marker specified' as separate from a marker
that is just an empty string.
Also: Successful datalog trim returns zero, not
-ENODATA, and radosgw-admin should expect this.
Fixes: https://tracker.ceph.com/issues/51712 Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
(cherry picked from commit 4cb6fcf7f4e9548a8a0dd017b49a6ea23bedffd6)
Conflicts:
src/rgw/rgw_admin.cc
Cherry-pick notes:
- static_cast for RadosStore not needed in Pacific
J. Eric Ivancich [Fri, 30 Apr 2021 20:07:54 +0000 (16:07 -0400)]
rgw: fix bucket object listing when marker matches prefix
When an iniitial marker that ends with a delimiter is provided, it
prevents listing of that "subdirectory" due to new logic at the cls
level to make listing more efficient. The fix catches that situation.
The bucket list wrapper lost the automatic advancement of the marker for
each iteration. Some places manually set that marker, and it worked
fine, but some depended on it being automatically set. Fix it so that
the marker advances automatically each iteration.
Cherry-pick notes:
- differences due to renaming of rgw::sal::RGWObject to rgw::sal::Object
- differences due to use of RadosNotification in master and reservation_t in Pacific
Since inject_facts_as_vars is set to false in the ansible.cfg file then we
have to update the references to use ansible_facts[<thing>] instead of
ansible_<thing>.
We already install the dependency from ceph-ansible requirements.txt and to
avoid false positive (like after rebooting a node) we can retry failing test.
Without loading the ansible.cfg file from ceph-ansible project, we don't
have the pipelining enabled which can result in significant performance
improvement.
This removes the ANSIBLE_ACTION_PLUGINS, ANSIBLE_RETRY_FILES_ENABLED and
ANSIBLE_SSH_RETRIES environment variables as it is already included in the
ansible.cfg file.
ceph-volume/tests: update ansible ssh_args env var
The ansible ssh_args parameter is usually defined in the ansible.cfg file.
Currently this variable is overrided in tox to manage the vagrant ssh file
but we lost all default values.
rpm: drop use of $FIRST_ARG in ceph-immutable-object-cache
The use of $FIRST_ARG was probably required because the SUSE-specific
%service_* rpm macros were playing tricks on the shell positional parameters.
This is bad practice and error-prone, so let's assume that no macros should do
that anymore and hence it's safe to assume that positional parameters remain
unchanged after any rpm macro call.
Kefu Chai [Sat, 1 May 2021 15:30:18 +0000 (23:30 +0800)]
common/pick_address: define in_addr_t if it is not defined
neither mingw not not have in_addr_t defined, see
https://docs.microsoft.com/en-us/windows/win32/api/winsock2/ns-winsock2-in_addr,
so define it if it is not defined.