xie xingguo [Mon, 5 Jun 2017 03:29:39 +0000 (11:29 +0800)]
os/bluestore: fix false asserts in Cache::trim_all()
These asserts are true if we are going to shutdown the BlueStore instance.
But the caller can also be something like "ceph daemon out/osd.1.asok flush_store_cache",
which can fire these asserts as a result.
- VLAs are in GCC and Clang, and are there to stay forever,
if only to be compatible with all the software that is already
out there.
- Theoretical debates about VLA being hard to implement are
long superceded by th actual implentations
- Before setting this flag is would be required to first start
work on fixing all the fallout/warnings that will arise from
setting -Wvla
- Allocating large variable/stuctures on the stack could be asking
for trouble, but changes that ceph tools are going to be running
on small embedded devices are rather slim.
Signed-off-by: Willem Jan Withagen <wjw@digiware.nl>
* do not let osd shutdown itself by enlarge osd_max_markdown_count and
shorten osd_max_markdown_period
* do not shutdown all osds in the last test. if all osds are shutdown at
the same time. none of them will get updated osdmap after noup is
unset. we should leave at least one of them, so the gossip protocol
can kick in, and populate the news to all osds.
Kefu Chai [Mon, 29 May 2017 06:28:51 +0000 (14:28 +0800)]
cmake: rgw: do not link against boost in a wholesale
With the new Beast frontend, RGW now has a small Boost dependency [1] which was
being addressed by statically (and unconditionally) linking *all* the Boost
libraries. This patch ensures that only the necessary Boost components are
linked.
We use the target_link_libraries(<target> <item>...) [2] syntax to ensure that the
library dependencies are transitive: i.e. "when this target is linked into
another target then the libraries linked to this target will appear on the link
line for the other target too."
[1] The boost/asio/spawn.hpp header used by rgw_asio_frontend.cc depends on
boost::coroutine/boost::context
This fixes librbd crashes currently observed on master, when
debug is on, because `rbd_image_options_t` is typedef-ed `void *`
and it's operator is used when attempting to print out an
address (`void *`) of any object.
Kefu Chai [Fri, 2 Jun 2017 04:43:07 +0000 (12:43 +0800)]
mon/OSDMonitor: filter the added creating_pgs added from pgmap
the creating_pgs added from pgmap might contains pgs whose containing
pools have been deleted. this is fine with the PGMonitor, as it has the
updated pg mapping which is consistent with itself. but it does not work
with OSDMonitor's creating_pgs, whose pg mapping is calculated by
itself. so we need to filter the pgmap's creating_pgs when adding them to
OSDMonitor's creating_pgs with the latest osdmap.get_pools().
Greg Farnum [Thu, 1 Jun 2017 05:19:02 +0000 (22:19 -0700)]
mon: mgr: remove osd_stats map from PGMapDigest
We use this information only for dumps. Stop dumping per-OSD stats as they're
not needed. In order to maintain pool "fullness" information, calculate
the OSDMap-based rule availibility ratios on the monitor and include those
values in the PGMapDigest. Also do it whenever we call dump_pool_stats_full()
on the manager.
Kefu Chai [Mon, 29 May 2017 16:40:25 +0000 (00:40 +0800)]
mgr: reset pending_inc after applying it
we cannot apply pending_inc twice and expect the result is the same. in
other words, pg_map.apply_incremental(pending_inc) is not an idempotent
operation.
Sage Weil [Thu, 25 May 2017 16:13:58 +0000 (12:13 -0400)]
osd/OSDMap: more efficient PGMapTemp
Use a flat_map with pointers into a buffer with the actual data. For a
decoded mapping, we have just two allocations (one for flat_map and one
for the encoded buffer).
This can get slow if you make lots of incremental changes after the fact
since flat_map is not efficient for modifications at large sizes. :/
Sage Weil [Tue, 23 May 2017 20:35:30 +0000 (16:35 -0400)]
mon/PGMap: count 'unknown' pgs
Also, count "not active" (inactive) pgs instead of active so that we
list "bad" things consistently, and so that 'inactive' is a separate
bucket of pgs than the 'unknown' ones.
Sage Weil [Tue, 23 May 2017 19:17:34 +0000 (15:17 -0400)]
mgr/MgrStandby: reset subscriptions when we become non-active
This is a goofy workaround that we're also doing in Mgr::init(). Someday
we should come up with a more elegant solution. In the meantime, this
works just fine!
Sage Weil [Tue, 23 May 2017 16:02:02 +0000 (12:02 -0400)]
mgr/ClusterState: make pg stat filtering less fragile
We want to drop updates for pgs for pools that don't exist. Keep an
updated set of those pools instead of relying on the previous PGMap
having them instantiated. (The previous map may drift due to bugs.)
Sage Weil [Tue, 23 May 2017 13:49:16 +0000 (09:49 -0400)]
mgr: apply PGMap incremental at same interval as reports
We were doing an incremental per osd stat report; this screws up the
delta stats updates when there are more than a handful of OSDs. Instead,
do it with the same period as the mgr->mon reports.
Sage Weil [Tue, 23 May 2017 03:39:09 +0000 (23:39 -0400)]
mon/OSDMonitor: use newest creation epoch for pgs that we can
If we have a huge pool it may take a while for the PGs to get out of the
queue and be created. If we use the epoch the pool was created it may
mean a lot of old OSDMaps the OSD has to process. If we use the current
epoch (the first epoch in which any OSD learned that this PG should
exist) we limit PastIntervals as much as possible.
It is still possible that we start trying to create a PG but the cluster
is unhealthy for a long time, resulting in a long PastIntervals that
needs to be generated by a primary OSD when it eventually comes up. So
this only partially
Partially-fixes: http://tracker.ceph.com/issues/20050 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Sat, 20 May 2017 21:54:09 +0000 (17:54 -0400)]
mon: speed up pg creates a bit
I don't see any noticeable load on bigbang cluster, so let's bump this up
a bit. Not being super aggressive here, though, since pool creation is so
rare and who really cares if ginormous clusters take a few minutes to
create all the PGs; better to make sure the mon is happy and responsive
during setup.
Sage Weil [Sat, 20 May 2017 21:25:11 +0000 (17:25 -0400)]
mon/PGMap: update osd_epoch in synchrony with osd_stat_updates
I'm not sure why this didn't bite us earlier, but there is an assert
in apply_incremental (not used in preluminous mon) and an implicit
dereference in PGMonitor::encode_pending (maybe didn't cause crash?)
that will trigger if we have an osd_stat_updates record without a matching
osd_epochs update. Maybe there is some subtle reason why the osd_epochs
update happens elsewhere in master (it doesn't on the mgr), but my guess
is we were silently dereferencing the invalid iterator and not noticing.
Anyway, it's easy to fix. We use the epoch from the previous PGMap.
Sage Weil [Fri, 19 May 2017 21:56:11 +0000 (17:56 -0400)]
mgr: simplify handling of new pgs/pools
Instantiate barebones pg records (creating+stale) in our PGMap when pgs
are created. These will switch to 'creating' when the pgs is in the
process of creating, and peering etc. The 'stale' is an indicator that
the mon may not have even asked the pg to create them yet.
All of the old meticulous tracking in PGMap for mappings for creating
pgs is useless to us; OSDMonitor has new code to handle it. This is
fast and simple.
Sage Weil [Fri, 19 May 2017 21:01:22 +0000 (17:01 -0400)]
mon/PGMap: new check_osd_map that takes a OSDMap& const
The previous version takes an Incremental and requires that we see
every single consecutive map in the history. This version is mgr-friendly
and just takes the latest OSDMap. It's a bit simpler too because it
ignores the full/nearfull (legacy preluminous) and last_osd_report.
Sage Weil [Fri, 19 May 2017 15:48:15 +0000 (11:48 -0400)]
mon/PGMap: cap health detail messages at 50 (configurable)
There are two cases where we spew health detail warnings for potentially
every pg. Cap those detail messages at 50 and, if we exceed that, include
a message saying how many more there are. This avoids huge lists of
detail messages going from the mgr to mon and also makes life better for
users of the health detail api.
Sage Weil [Fri, 19 May 2017 15:08:26 +0000 (11:08 -0400)]
mon/MgrStatMonitor: trim mgrstat states
We don't actually need any of these older states at all so I hard-coded
a constant (oh no!). In reality it doesn't matter what it is anyway
since PaxosService waits for paxos_service_trim_min (=250) to accumulate
before removing anything.
Kefu Chai [Wed, 17 May 2017 07:30:28 +0000 (15:30 +0800)]
mgr: add a command "mgr report"
* extract send_report() out of tick() so it can be reused.
* add a commmand "mgr report-mon" for mgr, so we are able to flush the
the mgr stats to mon actively without waiting for the tick. this
could help with the tests.
Sage Weil [Thu, 18 May 2017 22:16:55 +0000 (18:16 -0400)]
qa/tasks: use new reliable flush_pg_stats helper
The helper gets a sequence number from the osd (or osds), and then
polls the mon until that seq is reflected there.
This is overkill in some cases, since many tests only require that the
stats be reflected on the mgr (not the mon), but waiting for it to also
reach the mon is sufficient!
Sage Weil [Thu, 18 May 2017 21:19:08 +0000 (17:19 -0400)]
osd: report a seq from flush_pg_stats command
Report a sequence number when we flush_pg_stats. Combine the up_from and
a per-boot seq number to get a monotonically increasing value across OSD
restarts (we assume less than 4 billion stats reports in a single epoch).
Sage Weil [Thu, 18 May 2017 20:05:28 +0000 (16:05 -0400)]
mon/PGMonitor: clear PGMap data when require_luminous is set
Once the OSDMap flag is set there is no going back. Zero out the on-disk
PGMap data, and clear the in-memory PGMap to free up memory and make
bugs easier to spot.
Sage Weil [Thu, 18 May 2017 19:41:39 +0000 (15:41 -0400)]
mon/OSDMonitor: limit number of concurrently creating pgs
There is overhead for PGs we are creating because the mon has to track
which OSD each one current maps to. This can be problematic on a very
large cluster. Limit the overhead by setting a cap on the number of PGs
we are creating at once; leave the rest in a persistent queue.