Sage Weil [Tue, 29 Oct 2019 14:28:18 +0000 (09:28 -0500)]
mgr/telemetry: send device telemetry via per-host POST to device endpoint
We do not want to associate devices with clusters because that may
communicate unnecessary information about the association between vendors
and clusters (which, when large, are potentially identifying).
Instead, do a POST per host with all of the devices on that host only.
The devices endpoint does not log the POST time, so these per-host
records won't be associated with each other.
Sage Weil [Mon, 28 Oct 2019 19:59:43 +0000 (14:59 -0500)]
Merge PR #31168 into master
* refs/pull/31168/head:
ceph-daemon: try py2 import before py3
qa/suites/rados/singleton-nomsgr/ceph-daemon: make sure python3 is installed
qa/standalone/test_ceph_damon.sh: test with python2 and python3
mgr/ssh: python, not python3
ceph-daemon: python, not python3
ceph-daemon: os.makedirs
ceph-daemon: configparser is ConfigParser on py2
ceph-daemon: avoid py3-isms
Reviewed-by: Sebastian Wagner <swagner@suse.com> Reviewed-by: Alfredo Deza <adeza@redhat.com>
This causes some bluestore test failures due to an ENOENT right after a
new object is created. The simplest reproducer is the
ObjectStore/StoreTest.FiemapEmpty/2 test.
Fixes: https://tracker.ceph.com/issues/42495 Signed-off-by: Sage Weil <sage@redhat.com>
*** WARNING: ./usr/src/debug/ceph-15.0.0-6512.g62bd825.el8.x86_64/src/common/obj_bencher.cc is executable but has empty or no shebang, removing executable bit
*** WARNING: ./usr/src/debug/ceph-15.0.0-6512.g62bd825.el8.x86_64/src/erasure-code/shec/determinant.c is executable but has empty or no shebang, removing executable bit
*** WARNING: ./usr/src/debug/ceph-15.0.0-6512.g62bd825.el8.x86_64/src/os/bluestore/AvlAllocator.cc is executable but has empty or no shebang, removing executable bit
*** WARNING: ./usr/src/debug/ceph-15.0.0-6512.g62bd825.el8.x86_64/src/os/bluestore/AvlAllocator.h is executable but has empty or no shebang, removing executable bit
Kefu Chai [Sun, 27 Oct 2019 03:03:20 +0000 (11:03 +0800)]
mgr/dashboard: accept exceptions from builtin SSL
see also https://github.com/cherrypy/cheroot/pull/4
so we don't panic when client is trying to talk with us with an
unsupported protocol, the exception should be accepted, and the
client can fallback to supported protocol.
Kefu Chai [Sun, 27 Oct 2019 04:44:45 +0000 (12:44 +0800)]
ceph.spec.in: add missing python-yaml dependency for mgr-k8sevents
otherwise we might have:
```
ceph/src/pybind/mgr/k8sevents/__init__.py", line 1, in <module>
from .module import Module
File "/home/kchai/ceph/src/pybind/mgr/k8sevents/module.py", line 28, in <module>
import yaml
ImportError: No module named yaml
```
* timeout() is never passed any parameter when being called, so let's
remove the parameters list of "seconds" and "error_message"
* use `getattr()` instead of `hasattr()` for retrieving the
member variable of `self`
* pass `self` to wrapper function explicitly.
* return `func()` right away.
* hardwire the error message of `TimeoutError` to "Timer expired",
because
- as neither errno.ETIME nor errno.ETIMEOUT is portable
- the only caller of `TimeoutError` is `timeout()`, so there is
no need to have the flexibility to pass a different error message
* use `wraps()` as a decorator, simpler this way.
Ilya Dryomov [Thu, 24 Oct 2019 15:35:23 +0000 (17:35 +0200)]
krbd: retry on an empty list from udev_enumerate_scan_devices()
systemd 219 doesn't have the issue that is worked around in the
previous commit, but has a different one: udev_enumerate_scan_devices()
always succeeds, but sometimes returns an empty list when the device is
actually there. This happens rarely and at random so I haven't been
able to get to the bottom of it yet, but it looks like another similar
race condition in libudev.
Since an empty list is expected if the device isn't there, retry just
twice with a small sleep in-between. This appears to be enough: I got
7 occurrences per 600000 "rbd unmap" invocations, all of which needed
a single retry:
Boris Ranto [Thu, 24 Oct 2019 14:54:05 +0000 (16:54 +0200)]
restful: Query nodes_by_id for items
The node dict that is passed to the _gather_leaf_ids function from the
_gather_osds function does not have 'items' in it. We also can't use
buckets at this point since those only exist for leaf nodes, not all
nodes.
We need to query the nodes_by_id dict to get 'items' for a node inside
the _gather_leaf_ids function instead.
Ilya Dryomov [Mon, 7 Oct 2019 13:32:39 +0000 (15:32 +0200)]
krbd: retry on transient errors from udev_enumerate_scan_devices()
udev_enumerate_scan_devices() doesn't handle disappearing devices well.
If called while some devices are being removed, it sometimes propagates
ENOENT and ENODEV errors encountered operating on directory entries in
/sys that no longer exist. Some of these errors are suppressed, but
this isn't reliable and varies across versions. In particular, systemd
239 suppresses ENODEV from sd_device_new_from_syspath() but doesn't
suppress ENODEV from sd_device_get_devnum(). In systemd 243 the call
to sd_device_get_devnum() has been moved, but it still leaks ENOENT
from sd_device_get_is_initialized() (referring to the body of
FOREACH_DIRENT_ALL loop in enumerator_scan_dir_and_add_devices()).
Assume that all ENOENT and ENODEV errors are transient and retry the
call to udev_enumerate_scan_devices(). Don't limit the number, but log
each retry.
Sage Weil [Fri, 25 Oct 2019 02:03:34 +0000 (21:03 -0500)]
ceph-daemon: /var/run/ceph -> /var/run/ceph/$fsid
This is better than having a single /var/run/ceph on the host with a
weird naming scheme. Among other things, it means that we can access
the asok for any daemon for a given fsid from any container on the same
host with the same fsid (notably, a shell).
Sage Weil [Thu, 24 Oct 2019 23:56:04 +0000 (18:56 -0500)]
mon: fix tell to hybrid octopus/pre-octopus mons
We can't decide whether to use the new tell command style based on the
monmap.min_mon_release alone because some mons may be octopus even though
that hasn't updated yet. The same goes for if we look at the combined
features for the cluster--the underlying problem is the monmap doesn't
tell us which mons are octopus and which ones aren't, so we don't know
how to behave.
Instead, allow octopus+ mons to advertise the converted tell commands
going forward, for compatibility with pre-octopus clients (who do the old
style of tell) and for octopus+ clients talking to a min_mon_release <
octopus cluster.
Sage Weil [Thu, 24 Oct 2019 13:41:33 +0000 (08:41 -0500)]
ceph-daemon: only set up crash dir mount if it exists
Sometimes we run containers on a host that doesn't have a crash dir set
up (becuase no daemon has been deployed). Examples include shell and
ceph-volume.
David Zafman [Thu, 24 Oct 2019 18:31:52 +0000 (11:31 -0700)]
ceph-objectstore-tool: call collection_bits() crashes on the meta collection
Skip new check for meta collection
test:
Turn off osd_pool_default_pg_autoscale_mode just like bash tests do
Fix test by checking for new error message