Jason Dillaman [Sat, 23 Nov 2019 15:36:31 +0000 (10:36 -0500)]
cls/rbd: sanitize the mirror image status peer address after reading from disk
RADOS upgrade tests were failing when OSDs were partially upgraded since the
entity_addr_t::type overload wasn't being recovered when re-read. Now we will
always sanitize the on-disk entity address after reading it to avoid such
issues of on-disk encoding/decoding.
Fixes: https://tracker.ceph.com/issues/42891 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 1542d12e1b893166a5bc7b7a4c9a4474078a98be)
Conflicts:
src/cls/rbd/cls_rbd(_types).[h|cc]: the MirrorImageStatusOnDisk struct has moved
Patrick Donnelly [Thu, 21 Nov 2019 01:34:26 +0000 (17:34 -0800)]
Merge PR #30521 into nautilus
* refs/pull/30521/head:
qa: have kclient tests use new mount.ceph functionality
doc: document that the kcephfs mount helper will search keyring files for secrets
mount.ceph: fork a child to get info from local configuration
mount.ceph: track mon string and path inside ceph_mount_info
mount.ceph: add name and secret to ceph_mount_info
mount.ceph: add ceph_mount_info structure
mount.ceph: clean up debugging output and error messages
mount.ceph: clean up return codes
mount.ceph: add comment explaining why we need to modprobe
mount.ceph: use bools for flags
common: have read_secret_from_file return negative error codes
Reviewed-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Jos Collin [Mon, 1 Jul 2019 09:02:33 +0000 (14:32 +0530)]
rgw: Silence warning: control reaches end of non-void function
Build shows:
[ 53%] Building CXX object src/mds/CMakeFiles/mds.dir/JournalPointer.cc.o
ceph/src/rgw/rgw_rest_s3.cc: In member function ‘RGWOp* RGWHandler_REST_Bucket_S3::get_obj_op(bool)’:
ceph/src/rgw/rgw_rest_s3.cc:3588:5: warning: control reaches end of non-void function [-Wreturn-type]
} }
^ Fixes: 4ffc765c4c5debc665ade7769c4647c3a7278fd2 Fixes: http://tracker.ceph.com/issues/40747 Signed-off-by: Jos Collin <jcollin@redhat.com>
(cherry picked from commit abb2451dd5164e6b610589207b900a6464e21282)
Sage Weil [Wed, 7 Aug 2019 17:41:33 +0000 (12:41 -0500)]
os/bluestore/BlueFS: apply shared_alloc_size to shared device
Keep an alloc_size vector so that we have this value handy at all times.
Allow bluestore to fetch this value directly instead of looking at the
bluefs_* config options since this encapsulates things a bit better, and
also isn't vulnerable to the config setting changing at runtime.
Sage Weil [Thu, 8 Aug 2019 18:30:59 +0000 (13:30 -0500)]
os/bluestore/BlueFS: fix device_migrate_to_* to handle varying alloc sizes
The previous implementation moved extents individually. This caused
problems when moving an extent with a small alloc_size that wasn't
a multiple of the target device's alloc_size.
Instead, identify files with extents that need to be moved, and then read
the file in its entirety and rewrite it in its entirety.
Igor Fedotov [Tue, 16 Jul 2019 14:16:16 +0000 (17:16 +0300)]
os/bluestore: cleanup around allocator calls
Both stupid and bitmap allocator returs -ENOSPC if they're
unable to allocate any space. Existing callers aren't always
respect this - hence doing some cleanup.
Radu Toader [Wed, 30 Oct 2019 08:42:41 +0000 (10:42 +0200)]
mgr/dashboard: fix grafana dashboards
Fixes: https://tracker.ceph.com/issues/42542
Sort order was wrong for some dashboards,
fixed empty / buggy Top 3 clients IOPS by pool / Throughput - in Pools
Overall performance
fixed Avg utilization Multiple series found - in Host Overall
performance
Fixed invalid dimensions for plot - in OSD Overall performance
The previous bluestore_no_per_pool_stats_tolerance had a lot of possible
values, not all of which make sense to users. Replace with a single new
option, bluestore_fsck_error_on_no_per_pool_stats, which controls whether
the lack of per-pool stats is an error or a warning. On repair, we will
unconditionally convert to per-pool stats.
This brings us in sync with the newer
bluestore_fsck_error_on_no_per_pool_omap.
Note that one part of the ceph_test_objectstore test is dropped since it
is no longer possible to create a store with legacy stats.
luo rixin [Tue, 12 Nov 2019 08:36:53 +0000 (16:36 +0800)]
mon/PGMap: fix incorrect pg_pool_sum when delete pool
We found the pools num diplayed by "ceph -s" is not the same with
"ceph osd lspools" after deleting a pool sometime. The result is
Mgr ClusterState::ingest_pgstats get the old pg_stat which pg is
not deleted in some osd before the pool deleted and add to
pending_inc.pool_statfs_updates. The deleted pool will be added to
pg_pool_sum unconsciously by PGMap::apply_incremental and which has
been deleted in OSDMap. This will also casue MON's Segmentation
fault.
Fixes: https://tracker.ceph.com/issues/42689 Fixes: https://tracker.ceph.com/issues/42592 Fixes: https://tracker.ceph.com/issues/41228 Fixes: https://tracker.ceph.com/issues/40011 Signed-off-by: luo rixin <luorixin@huawei.com>
(cherry picked from commit 446eca8defddda44fea7c789065afdb1a9d38dae)
Sage Weil [Fri, 15 Nov 2019 23:16:43 +0000 (17:16 -0600)]
Merge PR #30851 into nautilus
* refs/pull/30851/head:
doc/mgr/crash: document missing commands, options
qa/suites/rados/singleton/all/test-crash: whitelist RECENT_CRASH
qa/suites/rados/mgr/tasks/insights: whitelist RECENT_CRASH
qa/tasks/mgr/test_insights: crash module now rejects bad crash reports
mgr/crash: don't make these methods static
mgr/BaseMgrModule: handle unicode health detail strings
mgr/crash: verify timestamp is valid
qa/suites/mgr: whitelist RECENT_CRASH
mgr/crash: remove unused var
mgr/crash: remove unused import 'six'
qa/workunits/rados/test_crash: health check
mgr/crash: improve validation on post
mgr/crash: automatically prune old crashes after a year
mgr/crash: raise RECENT_CRASH warning for recent (new) crashes
mgr/crash: add 'crash ls-new'
mgr/crash: add option and serve infra
mgr/crash: keep copy of crashes in memory
mgr/pg_autoscaler: adjust style to match built-in tables
mgr/crash: make 'crash ls' a nice table with a NEW column
mgr/crash: nicely format 'crash info' output
mgr/crash: add 'crash archive <id>', 'crash archive-all' commands
Sage Weil [Fri, 15 Nov 2019 23:16:30 +0000 (17:16 -0600)]
Merge PR #30849 into nautilus
* refs/pull/30849/head:
mgr/dashboard: fix mgr module API tests
qa/tasks/mgr/dashboard/test_mgr_module: remove enable/disable test from MgrModuleTelemetryTest
qa/tasks/mgr/dashboard/test_mgr_module: sync w/ telemetry
mgr/dashboard/qa: add more fields to report
Merge branch 'nautilus' into wip-device-telemetry-nautilus
PendingReleaseNotes: fix typo
PendingReleaseNotes: remove kludge
mgr/telemetry: add stats about crush map
mgr/telemetry: add rgw metadata
mgr/telemetry: include fs size (files, bytes, snaps)
mds: report r{files,bytes,snaps} via perfcounters
mgr/telemetry: mds cache stats
mgr/telemetry: add some rbd metadata
mgr/telemetry: note whether osd cluster_network is in use
mgr/telemetry: add host counts
mgr/telemetry: add more pool metadata
mgr/telemetry: remove crush rule name
mgr/telemetry: include min_mon_release and msgr v1 vs v2 addr count
mgr/telemetry: add CephFS metadata
mgr/telemetry: include balancer info (active=true/false, mode)
mgr/telemetry: include per-pool pg_autoscale info
mgr/telemetry: dict.pop() errs on nonexistent key
mgr/telemetry: send device telemetry via per-host POST to device endpoint
mgr/telemetry: fix remote into crash do_ls()
mgr/telemetry: clear the event after being awaken by it
mgr/telemetry: bump content revision and add a release note
telemetry/server: add device report endpoint
mgr/telemetry: include device telemetry
mgr/telemetry: salt osd ids too
mgr/telemetry: obscure entity_name with a salt
mgr/telemetry: force re-opt-in if the report contents change
mgr/telemetry: less noise in the log
mgr/telemetry: wake up serve on config change
mgr/telemetry: track telemetry report revisions
qa/tasks/mgr/dashboard/test_mgr_module: adjust expected schema
mgr/telemetry: separate out cluster config vs running daemons
mgr/telemetry: include any config options that are customized
mgr/telemetry: specify license when opting in
doc/mgr/telemetry: update
mgr/telemetry: move contact info to an 'ident' channel
mgr/telemetry: accept channel list to 'telemetry show'
mgr/telemetry: always generate new report for 'telemetry show'
mgr/telemetry: add 'device' channel and call out to devicehealth module
mgr/telemetry: add telemetry channel 'device'
mgr/telemetry: add separate channels
mgr/devicehealth: pull out MAX_SAMPLES
* use primitive types instead of `JLeaf(the_type)` as they are
equivalent in this context
* remove fields which are added only if certain channels are
activated.
* allow unknown fields, as we are including various stuff
in the report, for instance, osdmap, usage, crash info, etc.