TrackedOp: Removed redundant lock in OpTracker::_mark_event()
ops_in_flight_lock seems redundant in OpTracker::_mark_event()
and this lock is highly contended for. Removing the same
is giving a significant performance boost.
Somnath Roy [Mon, 18 Aug 2014 23:59:36 +0000 (16:59 -0700)]
CollectionIndex: Collection name is added to the access_lock name
The CollectionIndex constructor is changed to accept the coll_t
so that the collection name can be used to form access_lock(RWLock)
name.This is needed otherwise lockdep will report a recursive lock error
and assert. lockdep needs unique lock names for each Index object.
Fixes: #9145 Signed-off-by: Somnath Roy <somnath.roy@sandisk.com>
Loic Dachary [Mon, 18 Aug 2014 23:30:15 +0000 (01:30 +0200)]
erasure-code: preload the jerasure plugin
Load the jerasure plugin when ceph-osd starts to avoid the following
scenario:
* ceph-osd-v1 is running but did not load jerasure
* ceph-osd-v2 is installed being installed but takes time : the files
are installed before ceph-osd is restarted
* ceph-osd-v1 is required to handle an erasure coded placement group and
loads jerasure (the v2 version which is not API compatible)
* ceph-osd-v1 calls the v2 jerasure plugin and does not reference the
expected part of the code and crashes
Although this problem shows in the context of teuthology, it is unlikely
to happen on a real cluster because it involves upgrading immediately
after installing and running an OSD. Once it is backported to firefly,
it will not even happen in teuthology tests because the upgrade from
firefly to master will use the firefly version including this fix.
While it would be possible to walk the plugin directory and preload
whatever it contains, that would not work for plugins such as jerasure
that load other plugins depending on the CPU features, or even plugins
such as isa which only work on specific CPU.
Sage Weil [Tue, 12 Aug 2014 03:54:38 +0000 (20:54 -0700)]
mon/OSDMonitor: respect CRUSH weights for reweight-by-pg
Do not assume that all OSDs are weighted equally for reweight-by-pg.
Note that reweight-by-utilization already reweights based on the size of
the OSD volume; we presume that this is already reflected by the CRUSH
weights.
Sage Weil [Wed, 6 Aug 2014 15:51:18 +0000 (08:51 -0700)]
mon/OSDMonitor: reweight-by-pg for pool(s)
Allow the reweight-by-pg to look at a specific set of pools. If the list
is ommitted, use PGs from all pools. This allows you to focus on a
specific pool (the one that will dominate data usage). Otherwise things
may not be quite right because other pools may have PGs that contain
much less data.
Sage Weil [Wed, 6 Aug 2014 15:35:07 +0000 (08:35 -0700)]
mon/OSDMonitor: adjust weights up, when possible
Note when OSDs are underloaded, as well. If that is the case, adjust the
OSD reweight value if, if possible. (It won't always be possible since
weights are capped at 1.)
Note that we set the underload threshold to the average, as we want to
aggressively adjust weights up (back to 1.0) whenever possible. This gets
us a more efficient mapping calculation and reduces the amount of "noise"
in the weights.
Sage Weil [Mon, 4 Aug 2014 22:40:35 +0000 (15:40 -0700)]
mon/OSDMonitor: reweight-by-pg
This is just like reweight-by-utilization, but looks purely at the PG to
OSD mapping, not at the number of bytes used on the target disks. This
allows the reweighting to be done before any data is written into the
cluster, when no data will need to migrate as a result of the reweight.
Boris Ranto [Fri, 15 Aug 2014 17:34:27 +0000 (19:34 +0200)]
Fix -Wno-format and -Werror=format-security options clash
This causes build failure in latest fedora builds, ceph_test_librbd_fsx adds -Wno-format cflag but the default AM_CFLAGS already contain -Werror=format-security, in previous releases, this was tolerated but in the latest fedora rawhide it no longer is, ceph_test_librbd_fsx builds fine without -Wno-format on x86_64 so there is likely no need for the flag anymore
Signed-off-by: Boris Ranto <branto@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
Sage Weil [Fri, 15 Aug 2014 15:55:10 +0000 (08:55 -0700)]
osd: only require crush features for rules that are actually used
Often there will be a CRUSH rule present for erasure coding that uses the
new CRUSH steps or indep mode. If these rules are not referenced by any
pool, we do not need clients to support the mapping behavior. This is true
because the encoding has not changed; only the expected CRUSH output.
Fixes: #8963
Backport: firefly Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Wed, 13 Aug 2014 23:17:02 +0000 (16:17 -0700)]
mon/Paxos: share state and verify contiguity early in collect phase
We verify peons are contiguous and share new paxos states to catch peons
up at the end of the round. Do this each time we (potentially) get new
states via a collect message. This will allow peons to be pulled forward
and remain contiguous when they otherwise would not have been able to.
For example, if
If we got mon.1 first and then mon.2 second, we would store the new txns
and then boot mon.1 out at the end because 15..25 is not contiguous with
28..40. However, with this change, we share 26..30 to mon.1 when we get
the collect, and then 31..40 when we get mon.2's collect, pulling them
both into the final quorum.
It also breaks the 'catch-up' work into smaller pieces, which ought to
smooth out latency a bit.
Sage Weil [Thu, 14 Aug 2014 23:55:58 +0000 (16:55 -0700)]
mon/Paxos: verify all new peons are still contiguous at end of round
During the collect phase we verify that each peon has overlapping or
contiguous versions as us (and can therefore be caught up with some
series of transactions). However, we *also* assimilate any new states we
get from those peers, and that may move our own first_committed forward
in time. This means that an early responder might have originally been
contiguous, but a later one moved us forward, and when the round finished
they were not contiguous any more. This leads to a crash on the peon
when they get our first begin message.
For example:
- we have 10..20
- first peon has 5..15
- ok!
- second peon has 18..30
- we apply this state
- we are now 18..30
- we finish the round
- send commit to first peon (empty.. we aren't contiguous)
- send no commit to second peon (we match)
- we send a begin for state 31
- first peon crashes (it's lc is still 15)
Prevent this by checking at the end of the round if we are still
contiguous. If not, bootstrap. This is similar to the check we do above,
but reverse to make sure *we* aren't too far ahead of *them*.
Fixes: #9053 Signed-off-by: Sage Weil <sage@redhat.com>
Loic Dachary [Tue, 3 Jun 2014 20:20:29 +0000 (22:20 +0200)]
erasure-code: parse function for the mapping parameter
Each D letter is a data chunk. For instance:
_DDD_DDD
is going to parse into:
[ 1, 2, 3, 5, 6, 7 ]
the 0 and 4 positions are not used by chunks and do not show in the
mapping. Implement ErasureCode::parse to support a reasonable default
for the mapping parameter.
Add support for erasure code plugins that do not sequentially map the
chunks encoded to the corresponding index. This is mostly transparent to
the caller, except when it comes to retrieving the data chunks when
reading. For this purpose there needs to be a remapping function so the
caller has a way to figure out which chunks actually contain the data
and reorder them.