git.apps.os.sepia.ceph.com Git

rgw: fix crash when missing content-type in POST object

Fixes: #3941
This fixes a crash when handling S3 POST request and content type
is not provided.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit f41010c44b3a4489525d25cd35084a168dc5f537)

ReplicatedPG: make_snap_collection when moving snap link in snap_trimmer

Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 88956e3186798058a1170803f8abfc0f3cf77a07)

ReplicatedPG: correctly handle new snap collections on replica

Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 9e44fca13bf1ba39dbcad29111b29f46c49d59f7)

mon: Elector: reset the acked leader when the election finishes and we lost

Failure to do so will mean that we will always ack the same leader during
an election started by another monitor. This had been working so far
because we were still acking the existing leader if he was supposed to
still be the leader; or we were acking a new potentially leader; or we
would eventually fall behind on an election and start a new election
ourselves, thus resetting the previously acked leader. While this wasn't
something that mattered much until now, the timechecks code stumbled into
this tiny issue and was failing hard at completing a round because there
wouldn't be a reset before the election started -- timechecks are bound
to election epochs.

Fixes: #3854
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit c54781618569680898e77e151dd7364f22ac4aa1)

rbd: fix bench-write infinite loop

I/O was continously submitted as long as there were few enough ops in
flight. If the number of 'threads' was high, or caching was turned on,
there would never be that many ops in flight, so the loop would continue
indefinitely. Instead, submit at most io_threads ops per offset.

Fixes: #3413
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Sage Weil <sage.weil@inktank.com>
(cherry picked from commit d81ac8418f9e6bbc9adcc69b2e7cb98dd4db6abb)

rbd: Don't call ProgressContext's finish() if there's an error.

do_copy was different from the others; call pc.fail() on error and
do not call pc.finish().

Fixes: #3729
Signed-off-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 0978dc4963fe441fb67afecb074bc7b01798d59d)

librbd: establish watch before reading header

This eliminates a window in which a race could occur when we have an
image open but no watch established. The previous fix (using
assert_version) did not work well with resend operations.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit c4370ff03f8ab655a009cfd9ba3a0827d8c58b11)

Revert "librbd: ensure header is up to date after initial read"

Using assert version for linger ops doesn't work with retries,
since the version will change after the first send.
This reverts commit e1776809031c6dad441cfb2b9fac9612720b9083.

Conflicts:

qa/workunits/rbd/watch_correct_version.sh
(cherry picked from commit e0858fa89903cf4055889c405f17515504e917a0)

os/FileStore: only adjust up op queue for btrfs

We only need to adjust up the op queue limits during commit for btrfs,
because the snapshot initiation (async create) is currently
high-latency and the op queue is quiesced during that period.

This lets us revert 44dca5c, which disabled the extra allowance because
it is generally bad for non-btrfs writeahead mode.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 38871e27eca5a34de78db23aa3663f6cb045d461)

common/HeartbeatMap: fix uninitialized variable

Introduced by me in 132045ce085e8584a3e177af552ee7a5205b13d8. Thank you,
valgrind!

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 00cfe1d3af286ffab7660933415684f18449720c)

sharedptr_registry: remove extaneous Mutex::Locker declaration

For some reason, the lookup() retry loop (for when happened to
race with a removal and grab an invalid WeakPtr) locked
the lock again. This causes the #3836 crash since the lock
is already locked. It's rare since it requires a lookup between
invalidation of the WeakPtr and removal of the WeakPtr entry.

Fixes: #3836
Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 037900dc7a051ce2293a4ef9d0e71911b29ec159)

FileStore: ping TPHandle after each operation in _do_transactions

Each completed operation in the transaction proves thread
liveness, a stuck thread should still trigger the timeouts.

Fixes: #3928
Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 0c1cc687b6a40d3c6a26671f0652e1b51c3fd1af)

OSD: use TPHandle in peering_wq

Implement _process overload with TPHandle argument and use
that to ping the hb map between pgs and between map epochs
when advancing a pg. The thread will still timeout if
genuinely stuck at any point.

Fixes: 3905
Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit e0511f4f4773766d04e845af2d079f82f3177cb6)

WorkQueue: add TPHandle to allow _process to ping the hb map

Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 4f653d23999b24fc8c65a59f14905db6630be5b5)

ReplicatedPG: handle omap > max_recovery_chunk

span_of fails if len == 0.

Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 8a97eef1f7004988449bd7ace4c69d5796495139)

ReplicatedPG: correctly handle omap key larger than max chunk

Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit c3dec3e30a85ecad0090c75a38f28cb83e36232e)

ReplicatedPG: start scanning omap at omap_recovered_to

Previously, we started scanning omap after omap_recovered_to.
This is a problem since the break in the loop implies that
omap_recovered_to is the first key not recovered.

Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 09c71f2f5ee9929ac4574f4c35fb8c0211aad097)

ReplicatedPG: don't finish_recovery_op until the transaction completes

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 62a4b96831c1726043699db86a664dc6a0af8637)

ReplicatedPG: ack push only after transaction has completed

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 20278c4f77b890d5b2b95d2ccbeb4fbe106667ac)

ObjectStore: add queue_transactions with oncomplete

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 4d6ba06309b80fb21de7bb5d12d5482e71de5f16)

common/HeartbeatMap: inject unhealthy heartbeat for N seconds

This lets us test code that is triggered by an unhealthy heartbeat in a
generic way.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 132045ce085e8584a3e177af552ee7a5205b13d8)

os/FileStore: add stall injection into filestore op queue

Allow admin to artificially induce a stall in the op queue.  Forces the
thread(s) to sleep for N seconds.  We pause for 1 second increments and
recheck the value so that a previously stalled thread can be unwedged by
reinjecting a lower value (or 0).  To stall indefinitely, just injust
very large number.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 657df852e9c89bfacdbce25ea014f7830d61e6aa)

osd: do not join cluster if not healthy

If our internal heartbeats are failing, do not send a boot message and try
to join the cluster.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a4e78652cdd1698e8dd72dda51599348d013e5e0)

osd: hold lock while calling start_boot on startup

This probably doesn't strictly matter because start_boot doesn't need the
lock (currently) and few other threads should be running, but it is
better to be consistent.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit c406476c0309792c43df512dddb2fe0f19835e71)

osd: do not reply to ping if internal heartbeat is not healthy

If we find that our internal threads are stalled, do not reply to ping
requests. If we do this long enough, peers will mark us down. If we are
only transiently unhealthy, we will reply to the next ping and they will
be satisfied. If we are unhealthy and marked down, and eventually recover,
we will mark ourselves back up.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ad6b231127a6bfcbed600a7493ca3b66c68484d2)

osd: reduce op thread heartbeat default 30 -> 15 seconds

If the thread stalls for 15 seconds, let our internal heartbeat fail.
This will let us internally respond more quickly to a stalled or failing
disk.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 61eafffc3242357d9add48be9308222085536898)

osd: improve sub_op flag points

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 73a969366c8bbd105579611320c43e2334907fef)

osd: refactor ReplicatedPG::do_sub_op

PULL is the only case where we don't wait for active.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 23c02bce90c9725ccaf4295de3177e8146157723)

osd: make last state for slow requests more informative

Report on the last event string, and pass in important context for the
op event list, including:

- which peers were sent sub ops and we are waiting for
- which pg queue we are delayed by

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a1137eb3e168c2d00f93789e4d565c1584790df0)

osd: dump op priority queue state via admin socket

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 24d0d7eb0165c8b8f923f2d8896b156bfb5e0e60)

osd: simplify asok to single callback

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 33efe32151e04beaafd9435d7f86dc2eb046214d)

common/PrioritizedQueue: dump state to Formatter

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 514af15e95604bd241d2a98a97b938889c6876db)

common/PrioritizedQueue: add min cost, max tokens per bucket

Two problems.

First, we need to cap the tokens per bucket.  Otherwise, a stream of
items at one priority over time will indefinitely inflate the tokens
available at another priority.  The cap should represent how "bursty"
we allow a given bucket to be.  Start with 4MB for now.

Second, set a floor on the item cost.  Otherwise, we can have an
infinite queue of 0 cost items that start over queues.  More
realistically, we need to balance the overhead of processing small items
with the cost of large items.  I.e., a 4 KB item is not 1/1000th as
expensive as a 4MB item.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 6e3363b20e590cd9df89f2caebe71867b94cc291)

common/PrioritizedQueue: buckets -> tokens

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit c549a0cf6fae78c8418a3b4b0702fd8a1e4ce482)

note puller's max chunk in pull requests

this lets us calculate a cost value
(cherry picked from commit 128fcfcac7d3fb66ca2c799df521591a98b82e05)

osd: add OpRequest flag point when commit is sent

With writeahead journaling in particular, we can get requests that
stay in the queue for a long time even after the commit is sent to the
client while we are waiting for the transaction to apply to the fs.
Instead of showing up as 'waiting for subops', make it clear that the
client has gotten its reply and it is local state that is slow.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit b685f727d4c37a26cb78bd4a04cce041428ceb52)

osd: set PULL subop cost to size of requested data

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a1bf8220e545f29b83d965f07b1abfbea06238b3)

osd: use Message::get_cost() function for queueing

The data payload is a decent proxy for cost in most cases, but not all.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e8e0da1a577e24cd4aad71fb94d8b244e2ac7300)

osd: debug msg prio, cost, latency

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit bec96a234c160bebd9fd295df5b431dc70a2cfb3)

filestore: filestore_queue_max_ops 500 -> 50

Having a deep queue limits the effectiveness of the priority queues
above by adding additional latency.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 40654d6d53436c210b2f80911217b044f4d7643a)

osd: target transaction size 300 -> 30

Small transactions make pg removal nicer to the op queue. It also slows
down PG deletion a bit, which may exacerbate the PG resurrection case
until #3884 is addressed.

At least on user reported this fixed an osd that kept failing due to
an internal heartbeat failure.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 1233e8617098766c95100aa9a6a07db1a688e290)

os/FileStore: allow filestore_queue_max_{ops,bytes} to be adjusted at runtime

The 'committing' ones too.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit cfe4b8519363f92f84f724a812aa41257402865f)

osd: make osd_max_backfills dynamically adjustable

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 101955a6b8bfdf91f4229f4ecb5d5b3da096e160)

osd: make OSD a config observer

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 9230c863b3dc2bdda12c23202682a84c48f070a1)

Conflicts:

src/osd/OSD.cc

librbd: Allow get_lock_info to fail

If the lock class isn't present, EOPNOTSUPP is returned for lock calls
on newer OSDs, but sadly EIO on older; we need to treat both as
acceptable failures for RBD images. rados lock list will still fail.

Fixes #3744.

Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4483285c9fb16f09986e2e48b855cd3db869e33c)

osd: drop newlines from event descriptions

These produce extra newlines in the log.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 9a1f574283804faa6dbba9165a40558e1a6a1f13)

OSD: do deep_scrub for repair

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 0cb760f31b0cb26f022fe8b9341e41cd5351afac)

ReplicatedPG: ignore snap link info in scrub if nlinks==0

links==0 implies that the replica did not sent snap link information.

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 70c3512037596a42ba6eb5eb7f96238843095db9)

osd/PG: fix osd id in error message on snap collection errors

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 381e25870f26fad144ecc2fb99710498e3a7a1d4)

osd/ReplicatedPG: validate ino when scrubbing snap collections

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 665577a88b98390b9db0f9991836d10ebdd8f4cf)

ReplicatedPG: compare nlinks to snapcolls

nlinks gives us the number of hardlinks to the object.
nlinks should be 1 + snapcolls.size(). This will allow
us to detect links which remain in an erroneous snap
collection.

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit e65ea70ea64025fbb0709ee8596bb2878be0bbdc)

ReplicatedPG/PG: check snap collections during _scan_list

During _scan_list check the snapcollections corresponding to the
object_info attr on the object. Report inconsistencies during
scrub_finalize.

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 57352351bb86e0ae9f64f9ba0d460c532d882de6)

osd_types: add nlink and snapcolls fields to ScrubMap::object

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit b85687475fa2ec74e5429d92ee64eda2051a256c)

PG: move auth replica selection to helper in scrub

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 39bc65492af1bf1da481a8ea0a70fe7d0b4b17a3)

mon: note scrub errors in health summary

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 8e33a8b9e1fef757bbd901d55893e9b84ce6f3fc)

osd: fix rescrub after repair

We were rescrubbing if INCONSISTENT is set, but that is now persistent.
Add a new scrub_after_recovery flag that is reset on each peering interval
and set that when repair encounters errors.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a586966a3cfb10b5ffec0e9140053a7e4ff105d2)

osd: note must_scrub* flags in PG operator<<

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit d56af797f996ac92bf4e0886d416fd358a2aa08e)

osd: based INCONSISTENT pg state on persistent scrub errors

This makes the state persistent across PG peering and OSD restarts.

This has the side-effect that, on recovery, we rescrub any PGs marked
inconsistent. This is new behavior!

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 2baf1253eed630a7c4ae4cb43aab6475efd82425)

osd: fix scrub scheduling for 0.0

The initial value for pair<utime_t,pg_t> can match pg 0.0, preventing it
from being manually scrubbed. Fix!

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 26a63df97b2a12fd1a7c1e3cc9ccd34ca2ef9834)

osd: note last_clean_scrub_stamp, last_scrub_errors

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 389bed5d338cf32ab14c9fc2abbc7bcc386b8a28)

osd: add num_scrub_errors to object_stat_t

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 2475066c3247774a2ad048a2e32968e47da1b0f5)

osd: add last_clean_scrub_stamp to pg_stat_t, pg_history_t

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit d738328488de831bf090f23e3fa6d25f6fa819df)

osd: fix object_stat_sum_t dump signedness

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 6f6a41937f1bd05260a8d70b4c4a58ecadb34a2f)

osd: change scrub min/max thresholds

The previous 'osd scrub min interval' was mostly meaningless and useless.
Meanwhile, the 'osd scrub max interval' would only trigger a scrub if the
load was sufficiently low; if it was high, the PG might *never* scrub.

Instead, make the 'min' what the max used to be. If it has been more than
this many seconds, and the load is low, scrub. And add an additional
condition that if it has been more than the max threshold, scrub the PG
no matter what--regardless of the load.

Note that this does not change the default scrub interval for less-loaded
clusters, but it *does* change the meaning of existing config options.

Fixes: #3786
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 299548024acbf8123a4e488424c06e16365fba5a)

Conflicts:

PendingReleaseNotes

osd/PG: remove useless osd_scrub_min_interval check

This was already a no-op: we don't call PG::scrub_sched() unless it has
been osd_scrub_max_interval seconds since we last scrubbed. Unless we
explicitly requested in, in which case we don't want this check anyway.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 16d67c798b6f752a6e03084bafe861396b86baae)

osd: move scrub schedule random backoff to seperate helper

Separate this from the load check, which will soon vary dependon on the
PG.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a148120776d0930b265411332a60e93abfbf0423)

osd/PG: trigger scrub via scrub schedule, must_ flags

When a scrub is requested, flag it and move it to the front of the
scrub schedule instead of immediately queuing it. This avoids
bypassing the scrub reservation framework, which can lead to a heavier
impact on performance.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 62ee6e099a8e4873287b54f9bba303ea9523d040)

osd/PG: introduce flags to indicate explicitly requested scrubs

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 1441095d6babfacd781929e8a54ed2f8a4444467)

osd/PG: move scrub schedule registration into a helper

Simplifies callers, and will let us easily modify the decision of when
to schedule the PG for scrub.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 796907e2159371f84a16cbd35f6caa8ac868acf6)

os/FileStore: only flush inline if write is sufficiently large

Honor filestore_flush_min in the inline flush case.

Backport: bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 49726dcf973c38c7313ab78743b45ccc879671ea)

os/FileStore: fix compile when sync_file_range is missing;

If sync_file_range is not present, we always close inline, and flush
via fdatasync(2).

Fixes compile on ancient platforms like RHEL5.8.

Backport: bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 8ddb55d34c72e6df1023cf427cbd41f3f98da402)

osd: set pg removal transactions based on configurable

Use the osd_target_transaction_size knob, and gracefully tolerate bogus
values (e.g., <= 0).

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 5e00af406b89c9817e9a429f92a05ca9c29b19c3)

osd: make pg removal thread more friendly

For a large PG these are saturating the filestore and journal queues. Do
them synchronously to make them more friendly. They don't need to be fast.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4712e984d3f62cdf51ea67da8197eed18a5983dd)

os: move apply_transactions() sync wrapper into ObjectStore

This has nothing to do with the backend implementation.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit bc994045ad67fb70c7a0457b8cd29273dd5d1654)

os: add apply_transaction() variant that takes a sequencer

Also, move the convenience wrappers into the interface and funnel through
a single implementation.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit f6c69c3f1ac35546b90315fff625993ba5cd8c07)

osd: calculate initial PG mapping from PG's osdmap

The initial values of up/acting need to be based on the PG's osdmap, not
the OSD's latest. This can cause various confusion in
pg_interval_t::check_new_interval() when calling OSDMap methods due to the
up/acting OSDs not existing yet (for example).

Fixes: #3879
Reported-by: Jens Kristian S?gaard <jens@mermaidconsulting.dk>
Tested-by: Jens Kristian S?gaard <jens@mermaidconsulting.dk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 17160843d0c523359d8fa934418ff2c1f7bffb25)

osdmap: make replica separate in default crush map configurable

Add 'osd crush chooseleaf type' option to control what the default
CRUSH rule separates replicas across. Default to 1 (host), and set it
to 0 in vstart.sh.

Fixes: #3785
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit c236a51a8040508ee893e4c64b206e40f9459a62)

ceph: adjust crush tunables via 'ceph osd crush tunables <profile>'

Make it easy to adjust crush tunables. Create profiles:

legacy: the legacy values
argonaut: the argonaut defaults, and what is supported.. legacy! (*(
bobtail: best that bobtail supports
optimal: the current optimal values
default: the current default values

* In actuality, argonaut supports some of the tunables, but it doesn't
say so via the feature bits.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 19ee23111585f15a39ee2907fa79e2db2bf523f0)

msg/Pipe: use state_closed atomic_t for _lookup_pipe

We shouldn't look at Pipe::state in SimpleMessenger::_lookup_pipe() without
holding pipe_lock. Instead, use an atomic that we set to non-zero only
when transitioning to the terminal STATE_CLOSED state.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 82f8bcddb5fa09913eb477ee26c71d6b4bb8d97c)

msgr: inject delays at inconvenient times

Exercise some rare races by injecting delays before taking locks
via the 'ms inject internal delays' option.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a5d692a7b9b4bec2c27993ca37aa3fec4065292b)

msgr: fix race on Pipe removal from hash

When a pipe is faulting and shutting down, we have to drop pipe_lock to
take msgr lock and then remove the entry. The Pipe in this case will
have STATE_CLOSED. Handle this case in all places we do a lookup on
the rank_pipe hash so that we effectively ignore entries that are
CLOSED.

This fixes a race introduced by the previous commit where we won't use
the CLOSED pipe and try to register a new one, but the old one is still
registered.

See bug #3675.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e99b4a307b4427945a4eb5ec50e65d6239af4337)

msgr: don't queue message on closed pipe

If we have a con that refs a pipe but it is closed, don't use it. If
the ref is still there, it is only because we are racing with fault()
and it is about to (or just was) be detached. Either way,

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 6339c5d43974f4b495f15d199e01a141e74235f5)

msgr: atomically queue first message with connect_rank

Atomically queue the first message on the new pipe, without dropping
and retaking pipe_lock.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 7bf0b0854d1f2706a3a2302bcbf92dd5c8c888ef)

config_opts.h: default osd_recovery_delay_start to 0

This setting was intended to prevent recovery from overwhelming peering traffic
by delaying the recovery_wq until osd_recovery_delay_start seconds after pgs
stop being added to it. This should be less necessary now that recovery
messages are sent with strictly lower priority then peering messages.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Gregory Farnum <greg@inktank.com>
(cherry picked from commit 44625d4460f61effe2d63d8280752f10f159e7b4)

osdmaptool: more fix cli test

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit b0162fab3d927544885f2b9609b9ab3dc4aaff74)

osdmaptool: fix cli test

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 5bd8765c918174aea606069124e43c480c809943)

osdmaptool: allow user to specify pool for test-map-object

Fixes: #3820
Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Gregory Farnum <greg@inktank.com>
(cherry picked from commit 85eb8e382a26dfc53df36ae1a473185608b282aa)

rados.cc: fix rmomapkey usage: val not needed

Signed-off-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Samuel Just <samuel.just@inktank.com>
(cherry picked from commit 625c3cb9b536a0cff7249b8181b7a4f09b1b4f4f)

librados.hpp: fix omap_get_vals and omap_get_keys comments

We list keys greater than start_after.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 3f0ad497b3c4a5e9bef61ecbae5558ae72d4ce8b)

rados.cc: use omap_get_vals_by_keys in getomapval

Fixes: #3811
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit cb5e2be418924cf8b2c6a6d265a7a0327f08d00a)

rados.cc: fix listomapvals usage: key,val are not needed

Fixes: #3812
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 44c45e520cc2e60c6c803bb245edb9330bff37e4)

rgw: copy object should not copy source acls

Fixes: #3802
Backport: argonaut, bobtail

When using the S3 api and x-amz-metadata-directive is
set to COPY we used to copy complete metadata of source
object. However, this shouldn't include the source ACLs.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 37dbf7d9df93dd0e92019be31eaa1a19dd9569c7)

OSD: only trim up to the oldest map still in use by a pg

map_cache.cached_lb() provides us with a lower bound across
all pgs for in-use osdmaps. We cannot trim past this since
those maps are still in use.

backport: bobtail
Fixes: #3770
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 66eb93b83648b4561b77ee6aab5b484e6dba4771)

Revert "osdmap: spread replicas across hosts with default crush map"

This reverts commit 503917f0049d297218b1247dc0793980c39195b3.

This breaks vstart and teuthology configs. A better fix is coming.

mon: OSDMonitor: don't output to stdout in plain text if json is specified

Fixes: #3748
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 410906e04936c935903526f26fb7db16c412a711)

osdmap: spread replicas across hosts with default crush map

This is more often the case than not, and we don't have a good way to
magically know what size of cluster the user will be creating. Better to
err on the side of doing the right thing for more people.

Fixes: #3785
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 7ea5d84fa3d0ed3db61eea7eb9fa8dbee53244b6)

ReplicatedPG: fix snapdir trimming

The previous logic was both complicated and not correct.  Consequently,
we have been tending to drop snapcollection links in some cases.  This
has resulted in clones incorrectly not being trimmed.  This patch
replaces the logic with something less efficient but hopefully a bit
clearer.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 0f42c37359d976d1fe90f2d3b877b9b0268adc0b)

v0.56.1

msg/Pipe: prepare Message data for wire under pipe_lock

We cannot trust the Message bufferlists or other structures to be
stable without pipe_lock, as another Pipe may claim and modify the sent
list items while we are writing to the socket.

Related to #3678.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit d16ad9263d7b1d3c096f56c56e9631fae8509651)

msgr: update Message envelope in encode, not write_message

Fill out the Message header, footer, and calculate CRCs during
encoding, not write_message(). This removes most modifications from
Pipe::write_message().

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 40706afc66f485b2bd40b2b4b1cd5377244f8758)