__project_pg_history__ does an inverse traverse of the series
of osdmaps passed in to get a pg's pg_history_t filled, which
can become super inefficient if the osdmap list to check is
very long.
E.g., in one of our clusters, we've observed it took approximate
10s for a PG to finish it's projecting:
```
2018-08-27 13:51:58.694823 7f1e1335a700 15 osd.9 823276 project_pg_history 34.6e9 from 821893 to 823276, start ec=380829/380829 l
is/c 820412/820412 les/c/f 820413/820413/0 821785/821785/821785
2018-08-27 13:52:08.634230 7f1e1335a700 15 osd.9 823276 project_pg_history 34.6e9 acting|up changed in 822265 from [57]/[57] 57/5
7 -> [58,57]/[58,57] 58/58
2018-08-27 13:52:08.634244 7f1e1335a700 15 osd.9 823276 project_pg_history 34.6e9 up changed in 822265 from [57] 57 -> [58,57] 58
2018-08-27 13:52:08.634248 7f1e1335a700 15 osd.9 823276 project_pg_history 34.6e9 primary changed in 822265
2018-08-27 13:52:08.634250 7f1e1335a700 15 osd.9 823276 project_pg_history end ec=380829/380829 lis/c 820412/820412 les/c/f 82041
3/820413/0 822265/822265/822265
```
Quote from Sage:
> let's just kill this off entirely, and make the handle_pg_query_nopg
reply unconditionally. Or, maybe, do a single sloppy check to see if
the primary has changed since the original epoch... if the osdmap
happens to be in cache... or not.
The querying end will discard the reply if it is out of date from it's
perspective, so it doesn't matter, and I suspect the overhead of doing
the check is larger than the overhead of sending a query reply that gets ignored.
script: backport-create-issue: optionally take list of issue numbers
Make the script optionally take a comma-separated list of issue numbers.
(Could be just one issue.)
Before this patch, backport-create-issue script insisted on looping over all
issues in Pending Backport status. This was cumbersome in cases when only
one issue (or a couple issues) needed to be processed.
Sage Weil [Tue, 11 Sep 2018 14:27:51 +0000 (09:27 -0500)]
Merge PR #22987 into master
* refs/pull/22987/head:
common,rgw: rename sha1_digest_t
osd: decrement old chunk's reference count if the chunk has a reference.
src/test: add a unit test
osd: using fingerprint OID if fingerprint is set
osd: add flag interfaces in chunk_info_t
common/buffer.cc: add sha1 fingerprint
osd: add fingerprint property
mon: add a command to set fingerprint algorithm
John Spray [Tue, 24 Jul 2018 22:40:41 +0000 (18:40 -0400)]
mgr/progress: no progress event on unmoved pgs
PGs may not be moved on osd out, if there is no suitable
location for them to move to. In this situation
it doesn't make sense to have a progress event, as the
health warnings adequately communicate the situation.
The event was previously not getting moved to the completed
list. There are a couple more cases too:
- When some pgs go away (a pool is removed) during the event
- When the OSD comes back in after going out
ceph-volume lvm.batch.bluestore validation and reporting with VG reuse
Reworks the bluestore validation and reporting to account for reusable
VGs from fast devices, and adds validation calls to ensure the new way
to calculate this process will work.
Sage Weil [Mon, 10 Sep 2018 12:45:58 +0000 (07:45 -0500)]
Merge PR #23845 into master
* refs/pull/23845/head:
osd/OSDMap: include age in up and in counts for ceph status
mon/OSDMonitor: set new_last_{up,in}_change
osd/OSDMap: store last_up_change and last_in_change
mgr/MgrMap: include mgr age in map printer
mon/MgrMap: track active_changed timestamp
mon: include mon quorum age in status
include/utime: add utimespan_str helper
In https://github.com/ceph/ceph/pull/23958 an OSD will ping monitor
periodically now if it is stuck at __wait_for_healthy__. But in the
above case OSDs are still considering themselves as __active__ and
hence should miss that fixer.
Since these OSDs might be still able to contact with monitors (
otherwise there is no way for them to be marked up again) and send
beacons contiguously, we can simply get them out of the trap by
sharing some new maps with them.
James McClune [Fri, 31 Aug 2018 03:28:24 +0000 (23:28 -0400)]
doc: fixed hit set type link
Fixed reference link for hit set type value. Restructured wording in description. Fixes: https://tracker.ceph.com/issues/34539 Signed-off-by: James McClune <jmcclune@mcclunetechnologies.net>
mon/OSDMonitor: invalidate max_failed_since on cancel_report
max_failed_since might reference the very failure-report which is to be
cancelled. We can simply invalidate it here to make **get_failed_since()**
recalculate if necessary.
Sage Weil [Sat, 8 Sep 2018 00:34:00 +0000 (19:34 -0500)]
Merge PR #23449 into master
* refs/pull/23449/head:
osd/OSDMap: cleanup: s/tmpmap/nextmap/
qa/standalone/osd/osd-backfill-stats: fixes
osd/OSDMap: clean out pg_temp mappings that exceed pool size
mon/OSDMonitor: clean temps and upmaps in encode_pending, efficiently
osd/OSDMapMapping: do not crash if acting > pool size
Sage Weil [Fri, 31 Aug 2018 15:52:04 +0000 (10:52 -0500)]
qa/standalone/osd/osd-backfill-stats: fixes
Grep from the primary's log, not every osd's log.
For the backfill_remapped task in particular, after the pg_temp change it
just so happens that the primary changes across the pool size change and
thus two different primaries do (some) backfill. Fix that test to pass
the correct primary.
Other tests are unaffected as they do not (happen to) trigger a primary
change and already satisfied the (removed) check that only one OSD does
backfill.
Sage Weil [Mon, 6 Aug 2018 18:12:33 +0000 (13:12 -0500)]
osd/OSDMap: clean out pg_temp mappings that exceed pool size
If the pool size is reduced, we can end up with pg_temp mappings that are
too big. This can trigger bad behavior elsewhere (e.g., OSDMapMapping,
which assumes that acting and up are always <= pool size).
Fixes: http://tracker.ceph.com/issues/26866 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Mon, 6 Aug 2018 17:54:55 +0000 (12:54 -0500)]
mon/OSDMonitor: clean temps and upmaps in encode_pending, efficiently
- do not rebuild the next map when we already have it
- do this work in encode_pending, not create_pending, so we get bad
values before they are published.
mon: test if gid exists in pending for prepare_beacon
If it does not, send a null map. Bug introduced by 624efc64323f99b2e843f376879c1080276e036f which made preprocess_beacon only look
at the current fsmap (correctly). prepare_beacon relied on preprocess_beacon
doing that check on pending.
Running:
while sleep 0.5; do bin/ceph mds fail 0; done
is sufficient to reproduce this bug. You will see:
2018-09-07 15:33:30.350 7fffe36a8700 5 mon.a@0(leader).mds e69 preprocess_beacon mdsbeacon(24412/a up:reconnect seq 2 v69) v7 from mds.0 127.0.0.1:6813/2891525302 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
2018-09-07 15:33:30.350 7fffe36a8700 10 mon.a@0(leader).mds e69 preprocess_beacon: GID exists in map: 24412
2018-09-07 15:33:30.350 7fffe36a8700 5 mon.a@0(leader).mds e69 _note_beacon mdsbeacon(24412/a up:reconnect seq 2 v69) v7 noting time
2018-09-07 15:33:30.350 7fffe36a8700 7 mon.a@0(leader).mds e69 prepare_update mdsbeacon(24412/a up:reconnect seq 2 v69) v7
2018-09-07 15:33:30.350 7fffe36a8700 12 mon.a@0(leader).mds e69 prepare_beacon mdsbeacon(24412/a up:reconnect seq 2 v69) v7 from mds.0 127.0.0.1:6813/2891525302
2018-09-07 15:33:30.350 7fffe36a8700 15 mon.a@0(leader).mds e69 prepare_beacon got health from gid 24412 with 0 metrics.
2018-09-07 15:33:30.350 7fffe36a8700 5 mon.a@0(leader).mds e69 mds_beacon mdsbeacon(24412/a up:reconnect seq 2 v69) v7 is not in fsmap (state up:reconnect)
in the mon leader log. The last line indicates the problem was safely handled.
Fixes: http://tracker.ceph.com/issues/35848 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Sage Weil [Fri, 7 Sep 2018 20:55:21 +0000 (15:55 -0500)]
Merge PR #20469 into master
* refs/pull/20469/head:
osd/PG: remove warn on delete+merge race
osd: base project_pg_history on is_new_interval
osd: make project_pg_history handle concurrent osdmap publish
osd: handle pg delete vs merge race
osd/PG: do not purge strays in premerge state
doc/rados/operations/placement-groups: a few minor corrections
doc/man/8/ceph: drop enumeration of pg states
doc/dev/placement-groups: drop old 'splitting' reference
osd: wait for laggy pgs without osd_lock in handle_osd_map
osd: drain peering wq in start_boot, not _committed_maps
osd: kick split children
osd: no osd_lock for finish_splits
osd/osd_types: remove is_split assert
ceph-objectstore-tool: prevent import of pg that has since merged
qa/suites: test pg merging
qa/tasks/thrashosds: support merging pgs too
mon/OSDMonitor: mon_inject_pg_merge_bounce_probability
doc/rados/operations/placement-groups: update to describe pg_num reductions too
doc/rados/operations: remove reference to lpgs
osd: implement pg merge
osd/PG: implement merge_from
osdc/Objecter: resend ops on pg merge
osd: collect and record pg_num changes by pool
osd: make load_pgs remove message more accurate
osd/osd_types: pg_t: add is_merge_target()
osd/osd_types: pg_t::is_merge -> is_merge_source
osd/osd_types: adding or substracting invalid stats -> invalid stats
osd/PG: clear_ready_to_merge on_shutdown (or final merge source prep)
osd: debug pending_creates_from_osd cleanup, don't use cbegin
ceph-objectstore-tool: debug intervals update
mgr/ClusterState: discard pg updates for pgs >= pg_num
mon/OSDMonitor: fix long line
mon/OSDMonitor: move pool created check into caller
mon/OSDMonitor: adjust pgp_num_target down along with pg_num_target as needed
mon/OSDMonitor: add mon_osd_max_initial_pgs to cap initial pool pgs
osd/OSDMap: set pg[p]_num_target in build_simple*() methods
mon/PGMap: adjust SMALLER_PGP_NUM warning to use *_target values
mon/OSDMonitor: set CREATING flag for force-create-pg
mon/OSDMonitor: start sending new-style pg_create2 messages
mon/OSDMonitor: set last_force_resend_prenautilus for pg_num_pending changes
osd: ignore pg creates when pool FLAG_CREATING is not set
mgr: do not adjust pg_num until FLAG_CREATING removed from pool
mon/OSDMonitor: add FLAG_CREATING on upgrade if pools still creating
mon/OSDMonitor: prevent FLAG_CREATING from getting set pre-nautilus
mon/OSDMonitor: disallow pg_num changes while CREATING flag is set
mon/OSDMonitor: set POOL_CREATING flag until initial pool pgs are created
osd/osd_types: add pg_pool_t FLAG_POOL_CREATING
osd/osd_types: introduce last_force_resend_prenautilus
osd/PGLog: merge_from helper
osd: no cache agent or snap trimming during premerge
osd: notify mon when pending PGs are ready to merge
mgr: add simple controller to adjust pg[p]_num_actual
mon/OSDMonitor: MOSDPGReadyToMerge to complete a pg_num change
mon/OSDMonitor: allow pg_num to adjusted up or down via pg[p]_num_target
osd/osd_types: make pg merge an interval boundary
osd/osd_types: add pg_t::is_merge() method
osd/osd_types: add pg_num_pending to pg_pool_t
osd: allow multiple threads to block on wait_min_pg_epoch
osd: restructure advance_pg() call mechanism
mon/PGMap: prune merged pgs
mon/PGMap: track pgs by state for each pool
osd/SnapMapper: allow split_bits to decrease (merge)
os/bluestore: fix osr_drain before merge
os/bluestore: allow reuse of osr from existing collection
os/filestore: (re)implement merge
os/filestore: add _merge_collections post-check
os: implement merge_collection
os/ObjectStore: add merge_collection operation to Transaction