git.apps.os.sepia.ceph.com Git

OSDMonitor: use deepish_copy_from for remove_down_pg_temp

This is a backport of 368852f6c0a884b8fdc80a5cd6f9ab72e814d412.

Make a deep copy of the OSDMap to avoid clobbering the in-memory copy with
the call to apply_incremental.

Fixes: #7060
Signed-off-by: Sage Weil <sage@inktank.com>

OSDMap: deepish_copy_from()

Make a deep(ish) copy of another OSDMap. Unfortunatley we can't make the
compiler-generated copy operator/constructors private until c++11. :(

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit bd54b9841b9255406e56cdc7269bddb419453304)

buffer: make 0-length splice() a no-op

This was causing a problem in the Striper, but fixing it here will avoid
corner cases all over the tree. Note that we have to bail out before
the end-of-buffer check to avoid hitting that check when the bufferlist is
also empty.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit ff5abfbdae07ae8a56fa83ebaa92000896f793c2)

osdc/Striper: test zero-length add_partial_result

If we add a partial result that is 0-length, we used to hit an assert in
buffer::list::splice(). Add a unit test to verify the fix.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 28c7388d320a47657c2e12c46907f1bf40672b08)

packaging: apply udev hack rule to RHEL

In the RPM spec file there is a test to deploy the uuid hack udev rules
for older udev operating systems. This includes CentOS and RHEL, but the
check currently only is for CentOS, causing RHEL clients to get a bogus
osd rules file.

Adjust the conditional to apply to RHEL as well as CentOS. (The %{rhel}
macro is defined in both platforms' redhat-rpm-config package.)

Fixes http://tracker.ceph.com/issues/7245

Signed-off-by: Ken Dreyer <ken.dreyer@inktank.com>
(cherry picked from commit 64a0b4fa563795bc22753940aa3a4a2946113109)

mon/MDSMonitor: do not generate mdsmaps from already-laggy mds

There is one path where a mds that is not sending its beacon (e.g.,
because it is not running at all) will lead to proposal of new mdsmaps.
Fix it.

Backport: emperor, dumpling
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 584c2dd6bea3fe1a3c7f306874c054ce0cf0d2b5)

Fix #7187: Include all summary items in JSON health output

Signed-off-by: John Spray <john.spray@inktank.com>
(cherry picked from commit fdf3b5520d150f14d90bdfc569b70c07b0579b38)

qa/workunits/cephtool/test.sh: hashpspool takes int in emperor

Signed-off-by: Sage Weil <sage@inktank.com>

rgw: convert bucket info if needed

Fixes: #7110
In dumpling, the bucket info was separated into bucket entry point and
bucket instance objects. When setting bucket attrs we only ended up
updating the bucket instance object. However, pre-dumpling buckets still
keep everything at the entry-point object, so acl changes didn't affect
anything (because we never updated the entry point). This change just
converts the bucket info into the new format.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit a5f8cc7ec9ec8bef4fbc656066b4d3a08e5b215b)

osd: ignore OSDMap messages while we are initializing

The mon may occasionally send OSDMap messages to random OSDs, but is not
very descriminating in that we may not have authenticated yet. Ignore any
messages if that is the case; we will reqeust whatever we need during the
BOOTING state.

Fixes: #7093
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit f68de9f352d53e431b1108774e4a23adb003fe3f)

mon: only send messages to current OSDs

When choosing a random OSD to send a message to, verify not only that
the OSD id is up but that the session is for the same instance of that OSD
by checking that the address matches.

Fixes: #7093
Backport: emperor, dumpling
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 98ed9ac5fed6eddf68f163086df72faabd9edcde)

qa: test for error when ceph osd rm is EBUSY

http://tracker.ceph.com/issues/6824 fixes #6824

Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit 31507c90f0161c4569a2cc634c0b5f671179440a)

mon: set ceph osd (down|out|in|rm) error code on failure

Instead of always returning true, the error code is set if at least one
operation fails.

EINVAL if the OSD id is invalid (osd.foobar for instance).
EBUSY if trying to remove and OSD that is up.

When used with the ceph command line, it looks like this:

    ceph -c ceph.conf osd rm osd.0
    Error EBUSY: osd.0 is still up; must be down before removal.
    kill PID_OF_osd.0
    ceph -c ceph.conf osd down osd.0
    marked down osd.0.
    ceph -c ceph.conf osd rm osd.0 osd.1
    Error EBUSY: removed osd.0, osd.1 is still up; must be down before removal.

http://tracker.ceph.com/issues/6824 fixes #6824

Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit 15b8616b13a327701c5d48c6cb7aeab8fcc4cafc)

mon: OSDMonitor: fix some annoying whitespace

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 42c4137cbfacad5654f02c6608cc0e81b45c06be)

librbd: call user completion after incrementing perfcounters

The perfcounters (and the ictx) are only valid while the image is
still open. If the librbd user gets the callback for its last I/O,
then closes the image, the ictx and its perfcounters will be
invalid. If the AioCompletion object is has not run the rest of its
complete() method yet, it will access these now-invalid addresses,
possibly leading to a crash.

The AioCompletion object is independent of the ictx and does not
access it again after incrementing perfcounters, so avoid this race by
calling the user's callback after this step. The AioCompletion object
will be cleaned up by the rest of complete_request(), independent of
the ImageCtx.

Fixes: #5426
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 4cea7895da7331b84d8c6079851fdc0ff2f4afb1)

objecter: don't take extra throttle budget for resent ops

These ops have already taken their budget in the original op_submit().
It will be returned via put_op_budget() when they complete.
If there were many localized reads of missing objects from replicas,
or cache pool redirects, this would cause the objecter to use up all
of its op throttle budget and hang.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 8d0180b1b7b48662daef199931efc7f2a6a1c431)

rbd: check write return code during bench-write

This is allows rbd-bench to detect http://tracker.ceph.com/issues/6938
when combined with rapidly changing the mon osd full ratio.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 3caf3effcb113f843b54e06099099909eb335453)

objecter: resend all writes after osdmap loses the full flag

Now that the osd does not respond if it gets a map with the full flag
set first, clients need to resend all writes.

Clients talking to old osds are still subject to the race condition,
so both sides must be upgraded to avoid it.

Refs: #6938
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit e32874fc5aa6f59494766b7bbeb2b6ec3d8f190e)

osd: drop writes when full instead of returning an error

There's a race between the client and osd with a newly marked full
osdmap.  If the client gets the new map first, it blocks writes and
everything works as expected, with no errors from the osd.

If the osd gets the map first, however, it will respond to any writes
with -ENOSPC. Clients will pass this up the stack, and not retry these
writes later.  -ENOSPC isn't handled well by all clients. RBD, for
example, may pass it on to qemu or kernel rbd which will both
interpret it as EIO.  Filesystems on top of rbd will not behave well
when they receive EIOs like this, especially if the cluster oscillates
between full and not full, so some writes succeed.

To fix this, never return ENOSPC from the osd because of a map marked
full, and rely on the client to retry all writes when the map is no
longer marked full.

Old clients talking to osds with this fix will hang instead of
propagating an error, but only if they run into this race
condition. ceph-fuse and rbd with caching enabled are not affected,
since the ObjectCacher will retry writes that return errors.

Refs: #6938
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 4111729dda7437c23f59e7100b3c4a9ec4101dd0)

objecter: clean pause / unpause logic

op->paused holds now whether operation should be paused or not, and it's
being updated when scanning requests. No need to do a second scan.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 5fe3dc647bf936df8e1eb2892b53f44f68f19821)

objecter: set op->paused in recalc_op_target(), resend in not paused

When going through scan_requests() in handle_osd_map() we need to make
sure that if an op should not be paused anymore then set it on the op
itself, and return NEED_RESEND. Otherwise we're going to miss reset of
the full flag.
Also in handle_osd_map(), make sure that op shouldn't be paused before
sending it. There's a lot of cleanup around that area that we should
probably be doing, make the code much more tight.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 98ab7d64a191371fa39d840c5f8e91cbaaa1d7b7)

objecter: don't resend paused ops

Paused ops are meant to block on the client side until a new map that
unpauses them is recieved. If we send paused writes when the FULL flag
is set, we'll get -ENOSPC from the osds, which is not what Objecter
users expect. This may cause rbd without caching to produce an I/O
error instead of waiting for the cluster to have capacity.

Fixes: #6725
Backport: dumpling
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit c5c399d327cfc0d232d9ec7d49ababa914d0b21a)

v0.72.2

rgw: fix use-after-free when releasing completion handle

Backport: emperor, dumpling
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit c8890ab2d46fe8e12200a0d2f9eab31c461fb871)

rgw: don't return data within the librados cb

Fixes: #7030
The callback is running within a single Finisher thread, thus we
shouldn't block there. Append read data to a list and flush it within
the iterate context.

Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit d6a4f6adfaa75c3140d07d6df7be03586cc16183)

Partial revert "mon: osd pool set syntax relaxed, modify unit tests"

This reverts commit 08327fed8213a5d24cd642e12b38a171b98924cb, except
for the hashpspool bit. We switched back to an integer argument in
commit 337195f04653eed8e8f153a5b074f3bd48408998.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e80ab94bf44e102fcd87d16dc11e38ca4c0eeadb)
Reviewed-by: Greg Farnum <greg@inktank.com>

mon: OSDMonitor: drop cmdval_get() for unused variable

We don't ever use any value as a float, so just drop obtaining it. This
makes it easier to partially revert 2fe0d0d9 in an upcoming patch.

Backport: emperor

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 7191bb2b2485c7819ca7b9d9434d803d0c94db7a)
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>

mon: OSDMonitor: receive CephInt on 'osd pool set' instead on CephString

This partially reverts 2fe0d0d9 in order to allow Emperor monitors to
forward mon command messages to Dumpling monitors without breaking a
cluster.

The need for this patch became obvious after issue #6796 was triggered.
Basically, in a mixed cluster of Emperor/Dumpling monitors, if a client
happens to obtain the command descriptions from an Emperor monitor and
then issue an 'osd pool set' this can turn out in one of two ways:

1. client msg gets forwarded to an Emperor leader and everything's a-okay;
2. client msg gets forwarded to a Dumpling leader and the string fails to
be interpreted without the monitor noticing, thus leaving the monitor with
an uninitialized variable leading to trouble.

If 2 is triggered, a multitude of bad things can happen, such as thousands
of pg splits, due to a simple 'osd set pool foo pg_num 128' turning out
to be interpreted as 109120394 or some other random number.

This patch is such that we make sure the client sends an integer instead
of a string. We also make sure to interpret anything the client sends as
possibly being a string, or an integer.

Fixes: 6796
Backport: emperor

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 337195f04653eed8e8f153a5b074f3bd48408998)
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>

init, upstart: prevent daemons being started by both

There can be only one init system starting a daemon. If there is a
host entry in ceph.conf for a daemon, sysvinit would try to start it
even if the daemon's directory did not include a sysvinit file. This
preserves backwards compatibility with older installs using sysvinit,
but if an upstart file is present in the daemon's directory, upstart
will try to start them, regardless of host entries in ceph.conf.

If there's an upstart file in a daemon's directory and a host entry
for that daemon in ceph.conf, both sysvinit and upstart would attempt
to manage it.

Fix this by only starting daemons if the marker file for the other
init system is not present. This maintains backwards compatibility
with older installs using neither sysvinit or upstart marker files,
and does not break any valid configurations. The only configuration
that would break is one with both sysvinit and upstart files present
for the same daemon.

Backport: emperor, dumpling
Reported-by: Tim Spriggs <tims@uahirise.org>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 5e34beb61b3f5a1ed4afd8ee2fe976de40f95ace)

rgw: don't error out on empty owner when setting acls

Fixes: #6892
Backport: dumpling, emperor
s3cmd specifies empty owner field when trying to set acls on object
/ bucket. We errored out as it didn't match the current owner name, but
with this change we ignore it.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 14cf4caff58cc2c535101d48c53afd54d8632104)

rgw: lower some debug message

Fixes: #6084
Backport: dumpling, emperor

Reported-by: Ron Allred <rallred@itrefined.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit b35fc1bc2ec8c8376ec173eb1c3e538e02c1694e)

ReplicatedPG: test for missing head before find_object_context

find_object_context doesn't return EAGAIN for a missing head.
I chose not to change that behavior since it might hide bugs
in the future. All other callers check for missing on head
before calling into find_object_context because we potentially
need head or snapdir to map a snapid onto a clone.

Backport: emperor
Fixes: 6758
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit dd9d8b020286d5e3a69455023c3724a7b436d687)

osd: fix bench block size

The command was declared to take 'size' in dumpling, but was trying to
read 'bsize' instead, so it always used the default of 4MiB. Change
the bench command to read 'size', so it matches what existing clients
are sending.

Fixes: #6795
Backport: emperor, dumpling
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 40a76ef0d09f8ecbea13712410d9d34f25b91935)

v0.72.1

ceph-filestore-tool: add tool for fixing lost objects

Used to repair: #6761
Backport: emperor
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>

osd_types: fix object_info_t backwards compatibility

Shipping an object_info_t to a replica with the dirty
flag set would cause the replica to interpret that
object as being lost. Instead, we always encode
lost into the slot where dumpling expects to find
it and add another field at the end of the encoding.

Backport: emperor
Fixes: #6761
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>

v0.72

rgw: deny writes to a secondary zone by non-system users

Fixes: #6678
We don't want to allow regular users to write to secondary zones,
otherwise we'd end up with data inconsistencies.

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

doc/release-notes: note crush update timeout on startup change

Signed-off-by: Sage Weil <sage@inktank.com>

osdmaptool: fix cli tests

From c22c84a88c22688b6044ab37f65a3fe40dfe1983.

Signed-off-by: Sage Weil <sage@inktank.com>

Ceph: Fix memory leak in chain_flistxattr()

Free allocated memory before return.

Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Reviewed-by: Sage Weil <sage@inktank.com>

ReplicatedPG: don't skip missing if sentries is empty on pgls

Formerly, if sentries is empty, we skip missing. In general,
we need to continue adding items from missing until we get
to next (returned from collection_list_partial) to avoid
missing any objects.

Fixes: #6633
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>

PG: fix operator<<,log_wierdness log bound warning

Split may cause holes such that head != tail and yet
log.empty().

Fixes: #6722
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>

PGLog::rewind_divergent_log: log may not contain newhead

Due to split, there may be a hole at newhead.

Fixes: #6722
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>

Merge pull request #824 from dmick/next

osdmaptool: don't put progress on stdout

Reviewed-by: Sage Weil <sage@inktank.com>

RadosModel: use sharedptr_registry for snaps_in_use

There might be two concurrent rollback ops each of which
adds snap x to snaps_in_use. Between when the first
completes and the second completes, snap x may be removed
since the first would have removed snap x from snaps_in_use.
Using sharedptr_registry here avoids this by ensuring that
the snap won't be removed from snaps_in_use until all refs
are gone.

This patch also adds size() to sharedptr_registry.

Fixes: #6719
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>

osdmaptool: don't put progress on stdout

If one requests JSON output, the progress message pollutes the output;
don't do that, send it to stderr instead

Signed-off-by: Dan Mick <dan.mick@inktank.com>

FileStore::_collection_move_rename: handle missing dst dir on replay

In case of a replay, a missing destination directory indicates that
the destination object and directory have been removed by a later
transaction. Thus, we need to remove the src object and return
0.

Fixes: #6714
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>

Merge pull request #814 from ceph/wip-da-fix-galois-warning

galois.c: fix compiler warning

Reviewed-by: Loic Dachary <loic@dachary.org>

galois.c: fix compiler warning

galois_create_split_w8_tables() takes no parameter, remove '8' passed
to the function in one case.

osd/ErasureCodePluginJerasure/galois.c: In function 'galois_w32_region_multiply':
osd/ErasureCodePluginJerasure/galois.c:696:5: warning: call to function 'galois_create_split_w8_tables' without a real prototype [-Wunprototyped-calls]
In file included from osd/ErasureCodePluginJerasure/galois.c:53:0:
osd/ErasureCodePluginJerasure/galois.h:71:12: note: 'galois_create_split_w8_tables' was declared here

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>

OSD: allow project_pg_history to handle a missing map

If we get a peering message for an old map we don't have, we
can throwit out: the sending OSD will learn about the newer
maps and update itself accordingly, and we don't have the
information to know if the message is valid. This situation
can only happen if the sender was down for a long enough time
to create a map gap and its PGs have not yet advanced from
their boot-up maps to the current ones, so we can rely on it

Fixes: #6712
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>

OSD: don't clear peering_wait_for_split in advance_map()

I really don't know why I added this...  Ops can be discarded from the
waiting_for_pg queue if we aren't primary simply because there must have
been an exchange of peering events before subops will be sent within a
particular epoch.  Thus, any events in the waiting_for_pg queue must be
client ops which should only be seen by the primary.  Peering events, on
the other hand, should only be discarded if we are in a new interval,
and that check might as well be performed in the peering wq.

Fixes: #6681
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>

ReplicatedPG::recover_backfill: adjust last_backfill to HEAD if snapdir

Otherwise, if last_backfill_started is a snapdir, we will fail to send a
transaction for a client IO creating the head object and removing the
snapdir object. The result will be that head will eventually be
backfilled, but the snapdir object will erroneously not be removed.

Fixes: #6685
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>

Merge pull request #809 from ceph/wip-pgmap

Reviewed-by: Greg Farnum <greg@inktank.com>

osd/erasurecode: correct one variable name in jerasure_matrix_to_bitmatrix()

When bitmatrix is NULL, this function returns NULL.

Signed-off-by: Xing Lin <xinglin@cs.utah.edu>
Reviewed-by: Sage Weil <sage@inktank.com>

mon/PGMap: use const ref, not pass-by-value

Signed-off-by: Sage Weil <sage@inktank.com>

Merge pull request #806 from jdurgin/wip-xfstests

Don't run racy xfstest 008

Merge pull request #807 from jdurgin/wip-rbd-map-rw

rbd: omit 'rw' option during map

Reviewed-by: Sage Weil <sage@inktank.com>

Merge pull request #804 from jdurgin/wip-rgw-replica-log-next

rgw: don't turn 404 into 400 for the replicalog api

Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>

rbd: omit 'rw' option during map

The ro and rw options were added in linux 3.7. To be compatible with
older kernels, don't specify rw. The default will probably always be
rw, so this should not present any problems in the future.

Reported-by: nicolasc <nicolas.canceill@surfsara.nl>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

qa: don't run racy xfstest 008

This test attempts to generate a random number of holes within a
particular range, but may fail because hole placement is random.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

Merge pull request #802 from ceph/wip-6673b
(manually merged after next branch rebuilt)

OSDMonitor: be a little nicer about letting users do pg splitting

Reviewed-by: David Zafman <david.zafman@inktank.com>

OSDMonitor: be a little nicer about letting users do pg splitting

We were previously blocking pg splits whenever pg creations were in-
progress, but we only really need to avoid splitting any pgs which are
currently being created. Let the user set a different pg_num if there
are no creating PGs on the pool in question.

Fixes: #6673, take two
Signed-off-by: Greg Farnum <greg@inktank.com>

rgw: don't turn 404 into 400 for the replicalog api

404 is not actually a problem to clients like radosgw-agent, but 400
implies something about the request was incorrect.

Backport: dumpling
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

Wrap hex_to_num table into class HexTable

Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>

[rgw] Set initialized to true after populating table in hex_to_num()

Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>

sharedptr_registry.hpp: removed ptrs need to not blast contents

See the included unit test update. Consider:
1) x = lookup_or_create(1, 1)
2) remove(1)
3) y = lookup_or_create(1, 2)
4) x.reset()
5) z = lookup(1)

The bug is that z will be null since x.reset() caused the
cleanup callback to remove y's key value from contents.

To fix this, contents also records the pointer value for
the weak_ptr. The removal callback only removes the
key from contents if it matches the ptr in contents.

This should work since the pointer passed to the removal
callback must be unique up to that point since it has
not yet been deleted.

This allowed a pg removal -> pg recreation -> pg removal
sequence to cause the second pg removal entry to be
erroneously cleared by the first pg removal's destructor
as it finally made its way through the removal queue.

Fixes: #5951
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>

prio-q: initialize cur iterator

For new SubQueues `cur` is not intialized, so front/pop_front will freak
out. I honestly I have no idea how this hasn't been seen, but it was
being triggered frequently on OSX.

Fixes: #6686
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>

PGLog: remove obsolete assert in merge_log

This assert assumes that if olog.head != log.head, olog contains
a log entry at log.head, which may not be true since pg splitting
might have left the log with arbitrary holes.

Related: 0c2769d3321bff6e85ec57c85a08ee0b8e751bcb
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

test/osd/RadosModel.h: select and reserve roll_back_to atomically

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

test/rados/list.cc: we might get some objects more than once

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

os/chain_listxattr: fix leak fix

e22347df3854a5c5ebc6631c62d70447d67d722d added a bad goto; just free
explicitly instead.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Noah Watkins <noahwatkins@gmail.com>

Merge branch 'next' of jenkins:ceph/ceph into next

ceph: Release resource before return in BackedObject::download()

Close file before return

Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Reviewed-by: Sage Weil <sage@inktank.com>

ceph: Fix memory leak in chain_listxattr

Free allocated memory before return

Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Reviewed-by: Sage Weil <sage@inktank.com>

Fix memory leak in Backtrace::print()

Free already allocated memory if short of memory

Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Reviewed-by: Sage Weil <sage@inktank.com>

v0.72-rc1

Revert "ceph-crush-location: new crush location hook"

This reverts commit fc49065d855cfd74cb861d294f3464dd616e82ee.

Merged to wrong branch; my bad!

Revert "upstart, sysvinit: use ceph-crush-location hook"

This reverts commit 111a37efb19cb46a48d669bc9866c29b4015a889.

Merge pull request #779 from ceph/wip-crush-hook

upstart,sysvinit: allow 'osd crush location hook' script to determine osd crush position

Reviewed-by: Loic Dachary <loic@dachary.org>

upstart, sysvinit: use ceph-crush-location hook

Instead of hard-coding a check in ceph.conf and some reasonable
defaults, defer this work to ceph-crush-location, and allow users to
specify their own hook with alternative logic.

This can be helpful in a nubmer of cases, like:

- rack (or other) information included in hostname and easily parsed
out by a hook
- multiple types of devices in each host, resulting in 'parallel'
crush trees (e.g., one for hdd, one for ssd)

Signed-off-by: Sage Weil <sage@inktank.com>

ceph-crush-location: new crush location hook

This generalizes the bit of code that builds a key=value pair list to
update an entity's CRUSH location.

Signed-off-by: Sage Weil <sage@inktank.com>

Merge pull request #786 from ceph/wip-6673

mon/PGMonitor: always send pg creations after mapping

Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>

mon/PGMonitor: always send pg creations after mapping

At some point in the dumpling cycle I separated the map stage from the
send stage. We can send the creates any time we have a non-zero osdmap
epoch, and are in good shape as long as we do the map step after the
osdmap is loaded (hence the post_paxos_update).

Some background:

We originally introduced the map-but-don't send in a2fe0137, at which
point all was well because we only called it on ceph-mon startup.

Later, this turned into post_paxos_update in e635c478, at which point
it was now called by a running monitor.. but we didn't add in the
send_pg_creates(). This is where this bug stems from.

This particular path is responsible for the stalled test referenced in
bug #6673.

Backport: dumpling
Signed-off-by: Sage Weil <sage@inktank.com>

mon/OSDMonitor: fix signedness warning on poolid

Signed-off-by: Sage Weil <sage@inktank.com>

ReplicatedPG::recover_backfill: update last_backfill to max() when backfill is complete

Signed-off-by: Samuel Just <sam.just@inktank.com>

Merge pull request #780 from ceph/wip-6585

Wip 6585

Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>

osd/ReplicatedPG: use MIN for backfill_pos

Signed-off-by: Sage Weil <sage@inktank.com>

Merge pull request #772 from ceph/wip-5612

init-ceph, upstart: make crush update on osd start time out

Reviewed-by: Loic Dachary <loic@dachary.org>

ReplicatedPG: recover_backfill: don't prematurely adjust last_backfill

We can't adjust last_backfill to object x until x has been fully
backfilled. pending_backfill_updates contains all those backfills
started, but which have not yet been reflected in pinfo.last_update.
backfills_in_flight contains those backfills which have not yet
completed. Thus, we can adjust last_update to the largest entry
in pending_backfill_updates not in backfills_in_flight.

Signed-off-by: Samuel Just <sam.just@inktank.com>

ReplicatedPG: add empty stat when we remove an object in recover_backfill

Subsequent updates to that object need to have their stats added
to the backfill info stats atomically with the last_backfill
update.

Signed-off-by: Samuel Just <sam.just@inktank.com>

ReplicatedPG: replace backfill_pos with last_backfill_started

last_backfill_started reflects what pinfo.last_backfill will be
once all currently outstanding backfills complete.  backfill_pos
was tricky since we couldn't correctly inialize it without
doing the first backfill scan pair.

In recover_backfill, we rescan from last_backfill_started rather
than from backfill_pos.  This ensures that we capture all clones
created between last_backfill_started and what previously had been
backfill_pos without special handling in make_writeable.  The main
downside is that we will tend to "rescan" last_backfill_started.

Signed-off-by: Samuel Just <sam.just@inktank.com>

PG::BackfillInfo: introduce trim_to

We'll use this to trim off last_backfill_started since it'll
often be included in rescans.

Signed-off-by: Samuel Just <sam.just@inktank.com>

PG::BackfillInterval: use trim() in pop_front()

Signed-off-by: Samuel Just <sam.just@inktank.com>

ReplicatedPG::prepare_transaction: info.last_backfill is inclusive

Signed-off-by: Samuel Just <sam.just@inktank.com>

upstart: fail osd start if crush update fails

If the update for the CRUSH position fails for some reason, do not
start the OSD.

Signed-off-by: Sage Weil <sage@inktank.com>

init-ceph: make crush update on osd start time out

If the monitor is not currently available, this crush update would block
forever, preventing the OSD and (potentially) the rest of the system
from starting up. Instead, make it time out after 10 seconds and then
abort startup. This prevents startup of an OSD if we failed to update
the CRUSH position for some reason.

In fact, do not start up the OSD if the CRUSH update fails for any
reason--not just a timeout!

Works-around: #5612
Signed-off-by: Sage Weil <sage@inktank.com>

Merge pull request #778 from ceph/wip-6621

radosgw-admin: accept negative values for quota params

Reviewed-by: Sage Weil <sage@inktank.com>

radosgw-admin: accept negative values for quota params

and document that in the usage output.

Fixes: #6621
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

Merge pull request #760 from ceph/wip-6585

Wip 6585

Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>