Sage Weil [Tue, 2 Apr 2019 01:52:46 +0000 (20:52 -0500)]
Merge PR #27220 into nautilus
* refs/pull/27220/head:
osd/PG: move '}' to the proper place
doc: Document new pg state and changes to auto repair behavior
osd, test: Add num_shards_repaired to osd_stat_t for pushes with repair set 3(3)
osd: Track num_objects_repaired in pg stats 2(3)
test, osd: Improvements to auto_repair 1(3)
test: osd-scrub-repair.sh: use corrupt_and_repair_lrc for lrc tests
osd: Publish stats after all changes made
osd: Fixes for 64-bit PG state
Sage Weil [Tue, 2 Apr 2019 01:52:00 +0000 (20:52 -0500)]
Merge PR #27278 into nautilus
* refs/pull/27278/head:
log,global: do not start flusher thread until after we have our mon config
log: buffer log entries until flusher thread starts
log: open log file from flusher thread
common/ceph_context: fix log_to_file observer
common: add bool log_to_file option
David Zafman [Wed, 13 Mar 2019 05:22:53 +0000 (22:22 -0700)]
test, osd: Improvements to auto_repair 1(3)
Allow auto_repair for replicated bluestore pools
Regular scrub within auto repair parameters will trigger deep scrub
New state failed_repair if PG repair attempt could not fix everything
Set failed_repair if not possible to repair anything
Fixes: http://tracker.ceph.com/issues/38616 Signed-off-by: David Zafman <dzafman@redhat.com>
(cherry picked from commit 2202e5d0b107795837ce79ffce2a980e8c12fc62)
Sage Weil [Sat, 30 Mar 2019 13:35:23 +0000 (08:35 -0500)]
Merge PR #27139 into nautilus
* refs/pull/27139/head:
os/bluestore: unconditionally cap chunks returned by allocator to 2^31
os/bluestore: start using 64-bit intervals for bitmap allocator
os/bluestore: make bluestore interval base template.
tests/fastbmap_alloc: UT to reproduce 4G allocation bug
os/bluestore: os/bluestore: implement dump for bitmap allocator
os/bluestore be more tolerant to lack of space for bluefs.
Sage Weil [Fri, 22 Mar 2019 15:58:30 +0000 (10:58 -0500)]
log,global: do not start flusher thread until after we have our mon config
We want to respect the settings in the mon config that affect logging before
we enable the flusher thread. That allows us to set (via the monitor) things
like log_to_file=false or log_to_syslog=true.
Sage Weil [Tue, 19 Mar 2019 10:48:00 +0000 (05:48 -0500)]
common: add bool log_to_file option
This allows us to disable and reenable logging to a file while preservng
the default log_file location. This is analogous to log_to_stderr,
log_to_syslog, log_to_graylog, etc.
Volker Theile [Fri, 22 Mar 2019 12:01:02 +0000 (13:01 +0100)]
mgr: Configure Py root logger for Mgr modules
Add the CPlusPlusHandler to the Python root logger, thus the Mgr module itself and all 3rd party libraries will log to the Ceph log. This may help to identify problems in used 3rd party libraries.
The changes of this PR were done while trying to fix the failing test. The problem has been solved by another PR, but the changes are worth to be integrated because they help debugging and an additional test has been added (check if previously active manager is listed as standby).
xie xingguo [Sat, 23 Mar 2019 01:50:27 +0000 (09:50 +0800)]
osd/OSDMap: calc_pg_upmaps - restrict optimization to origin pools only
The current implementation will try to cancel any pg_upmaps that
would otherwise re-map a PG out from an underfull osd, which is wrong,
e.g., because it could reliably fire the following assert:
huangjun [Wed, 20 Mar 2019 08:44:02 +0000 (16:44 +0800)]
crush: add root_bucket to identify underfull buckets
All underfull buckets under root_buckets will be taken as target
For the crule rule:
step take datacenter0
step chooseleaf firstn 2 type host
step emit
step take datacenter1
step chooseleaf firstn 2 type host
step emit
If one host contains overfull osd but no underfull osd,
it will use other underfull buckets as target, which
maybe not in the same datacenter, that will
broke the rule.
Sage Weil [Mon, 25 Mar 2019 18:40:19 +0000 (13:40 -0500)]
common/config: parse --default-$option as a default value
Sometimes it is useful to specify an alternative default value for an
option via the command line such that it has a lower priority than the
mon config database, config file, the rest of the command line, or the
environment.
Sage Weil [Sun, 24 Mar 2019 15:28:42 +0000 (10:28 -0500)]
Merge PR #27119 into nautilus
* refs/pull/27119/head:
crush/CrushWrapper: make update_choose_args less chatty
qa/standalone/crush/crush-choose-args: add weight-set tests
qa/standalone/crush/crush-choose-args: fix test
crush/CrushWrapper: move_item: do not clobber weight-set weights
crush/CrushWrapper: create_or_move: make weight-set update optional
mon/OSDMonitor: apply osd_crush_update_weight_set for reweight, create-or-move
crush/CrushWrapper: insert_item: make weight-set update optional (for leaves only)
crush/CrushWrapper: use adjust_item_weight_in_bucket for subtree reweight
crush/CrushWrapper: fix detach_bucket, remove_item[_under] vs weight-sets
crush/CrushWrapper: add update_weight_sets arg to adjust_item_weight_*
crush/CrushWrapper: refactor adjust_weight_* into per-bucket helper
crush/CrushWrapper: pass cct down into more places
Igor Fedotov [Mon, 11 Mar 2019 16:13:19 +0000 (19:13 +0300)]
os/bluestore be more tolerant to lack of space for bluefs.
'gift' space is just advisory for allocation, part of it actually requested
from BlueFS is mandatory only. Hence do not fail when unable to allocate
the whole space.
Fixes: https://tracker.ceph.com/issues/38760 Signed-off-by: Igor Fedotov <ifedotov@suse.com>
(cherry picked from commit dbc1a78787baacd7bbc98ff8bbb72e609def2ad6)
Verify we have the expected behavior for creates and moves that
maintain bucket summation, both with and without the
osd_crush_update_weight_set option enabled.
Sage Weil [Thu, 14 Mar 2019 16:29:10 +0000 (11:29 -0500)]
mon/OSDMonitor: apply osd_crush_update_weight_set for reweight, create-or-move
Since CrushWrapper no longer applies this setting at a low level,
where it can't tell what the real intention is, we instead apply
it at the top command level where we do.
Specifically, we use it to control whether the weight-set weights
are set for the commands
Note that this (indirectly) affects the way weight-set weights
are initialized for newly created OSDs, since those are added to
the crush map via the 'osd crush create-or-move' command.
Sage Weil [Thu, 14 Mar 2019 17:40:23 +0000 (12:40 -0500)]
crush/CrushWrapper: insert_item: make weight-set update optional (for leaves only)
If it is a bucket, we should sum the weight-set values to weight
the bucket in the subtrees. It only makes sense to reset the
weight-set weights for leaf items.
Sage Weil [Thu, 14 Mar 2019 16:29:10 +0000 (11:29 -0500)]
crush/CrushWrapper: add update_weight_sets arg to adjust_item_weight_*
- Make it optional whether the weight-set weights are adjusted to
match the weight.
- Fix the adjustment of the parent bucket(s) so that the
summations in weight-sets are correctly maintained. Prior to
this change, if I adjust any weight, all parent buckets'
weight-set weights are reset to the bucket's primary weight.
Sebastian Wagner [Wed, 13 Feb 2019 14:01:25 +0000 (15:01 +0100)]
mgr/orchestrator: Add error handling to interface
Also:
* Small test_orchestrator refactorization
* Improved Docstring in MgrModule.remote
* Added `raise_if_exception` that raises Exceptions
* Added `OrchestratorError` and `OrchestratorValidationError`
* `_orchestrator_wait` no longer raises anything
* `volumes` model also calls `raise_if_exception`
Jason Dillaman [Wed, 20 Mar 2019 18:40:50 +0000 (14:40 -0400)]
librbd: ignore -EOPNOTSUPP errors when retrieving image group membership
The Luminous release did not support adding images to a group (it only
included the bare-minimum support for creating groups). Commit f76df32666b
incorrectly dropped support for ignoring this possible failure. This
prevents Nautilus-release clients from opening images contained within
a Luminous-release cluster.
Fixes: http://tracker.ceph.com/issues/38834 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Sage Weil [Sat, 16 Mar 2019 20:06:00 +0000 (15:06 -0500)]
mon/OSDMonitor: allow 'osd pool set pgp_num_actual'
Normally we let the mgr control pgp_num_actual for us in a nice, safe, controlled
way. However, it is very conservative, and only makes changes if all PGs are healthy.
There are situations where the user wants to be move aggressive than this.
For example, if you have a pool with many PGs (say, 4096) and set pg_num_target to a
small number like 4, the mgr will adjust pgp_num way down. This can lead to an OSD
hitting max_pgs_per_osd. That prevents the PGs from being active+clean, however,
which prevents the mgr adjusting pgp_num back up even if the user sets the target to
a larger value.
This patch lets the user directly adjust pgp_num_actual. Note that we still do
not expose access to pg_num_actual, since there are much stricter conditions that
must be true in order to safely make downward adjustments.
The stress-split thrasher already had this off, but the ec variant did
not. We don't support ceph-objectstore-tool exports/imports between major
versions.
Fixes: http://tracker.ceph.com/issues/38294 Signed-off-by: Sage Weil <sage@redhat.com>