git.apps.os.sepia.ceph.com Git

ceph-disk: do not stop activate-all on first failure

Keep going even if we hit one activation error. This avoids failing to
start some disks when only one of them won't start (e.g., because it
doesn't belong to the current cluster).

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit c9074375bfbe1e3757b9c423a5ff60e8013afbce)

ceph.spec: include partuuid rules in package

Commit f3234c147e083f2904178994bc85de3d082e2836 missed this.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 253069e04707c5bf46869f4ff5a47ea6bb0fde3e)

ceph.spec: install/uninstall init script

This was commented out almost years ago in commit 9baf5ef4 but it is not
clear to me that it was correct to do so. In any case, we are not
installing the rc.d links for ceph, which means it does not start up after
a reboot.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit cc9b83a80262d014cc37f0c974963cf7402a577a)

sysvinit, upstart: ceph-disk activate-all on start

On 'service ceph start' or 'service ceph start osd' or start ceph-osd-all
we should activate any osd GPT partitions.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 13680976ef6899cb33109f6f841e99d4d37bb168)

ceph-disk: add 'activate-all'

Scan /dev/disk/by-parttypeuuid for ceph OSDs and activate them all. This
is useful when the event didn't trigger on the initial udev event for
some reason.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 5c7a23687a1a21bec5cca7b302ac4ba47c78e041)

udev: /dev/disk/by-parttypeuuid/$type-$uuid

We need this to help trigger OSD activations.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit d512dc9eddef3299167d4bf44e2018b3b6031a22)

rgw: escape prefix correctly when listing objects

Fixes: #5362
When listing objects prefix needs to be escaped correctly (the
same as with the marker). Otherwise listing objects with prefix
that starts with underscore doesn't work.
Backport: bobtail, cuttlefish

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit d582ee2438a3bd307324c5f44491f26fd6a56704)

messages/MMonSync: initialize crc in ctor

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit cd1c289b96a874ff99a83a44955d05efc9f2765a)

client: fix ancient typo in caps revocation path

If we have dropped all references to a revoked capability, send the ack
to the MDS. This typo has been there since v0.7 (early 2009)!

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit b7143c2f84daafbe2c27d5b2a2d5dc40c3a68d15)

messages/MMonHealth: remove unused flag field

This was initialized in (one of) the ctor(s), but not encoded/decoded,
and not used. Remove it. This makes valgrind a happy.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 08bb8d510b5abd64f5b9f8db150bfc8bccaf9ce8)

messages/MMonProbe: fix uninitialized variables

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4974b29e251d433101b69955091e22393172bcd8)

common/Preforker: fix broken recursion on exit(3)

If we exit via preforker, call exit(3) and not recursively back into
Preforker::exit(r).  Otherwise you get a hang with the child blocked
at:

Thread 1 (Thread 0x7fa08962e7c0 (LWP 5419)):
#0  0x000000309860e0cd in write () from /lib64/libpthread.so.0
#1  0x00000000005cc906 in Preforker::exit(int) ()
#2  0x00000000005c8dfb in main ()

and the parent at

#0  0x000000309860eba7 in waitpid () from /lib64/libpthread.so.0
#1  0x00000000005cc87a in Preforker::parent_wait() ()
#2  0x00000000005c75ae in main ()

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 7e7ff7532d343c473178799e37f4b83cf29c4eee)

rules: Don't disable tcmalloc on ARM (and other non-intel)

Fixes #5342

Signed-off-by: Gary Lowell <gary.lowell@inktank.com>

Remove mon socket in post-stop

If ceph-mon segfault, socket file isn't removed.

By adding a remove in post-stop, upstart clean run directory properly.

Signed-off-by: Guilhem Lettron <guilhem@lettron.fr>
(cherry picked from commit 554b41b171eab997038e83928c462027246c24f4)

Remove stop on from upstart tasks

Upstart tasks don't have to concept of 'stop on' as they
are not long running.
(cherry picked from commit 17f6fccabc262b9a6d59455c524b550e77cd0fe3)

ceph-disk: extra dash in error message

Signed-off-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit f86b4e7a4831c684033363ddd335d2f3fb9a189a)

ceph-disk: cast output of _check_output()

Cast output of _check_output() to str() to be able to use
str.split().

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
(cherry picked from commit 16ecae153d260407085aaafbad1c1c51f4486c9a)

ceph-disk: remove unnecessary semicolons

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
(cherry picked from commit 9785478a2aae7bf5234fbfe443603ba22b5a50d2)

ceph-disk: fix undefined variable

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
(cherry picked from commit 9429ff90a06368fc98d146e065a7b9d1b68e9822)

ceph-disk: add missing spaces around operator

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
(cherry picked from commit c127745cc021c8b244d721fa940319158ef9e9d4)

udev: drop useless --mount argument to ceph-disk

It doesn't mean anything anymore; drop it.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit bcfd2f31a50d27038bc02e645795f0ec99dd3b32)

ceph-disk-udev: activate-journal

Trigger 'ceph-disk activate-journal' from the alt udev rules.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit b139152039bfc0d190f855910d44347c9e79b22a)

ceph-disk: do not use mount --move (or --bind)

The kernel does not let you mount --move when the parent mount is
shared (see, e.g., https://bugzilla.redhat.com/show_bug.cgi?id=917008
for another person this also confused). We can't use --bind either
since that (on RHEL at least) screws up /etc/mtab so that the final
result looks like

/var/lib/ceph/tmp/mnt.HNHoXU /var/lib/ceph/osd/ceph-0 none rw,bind 0 0

Instead, mount the original dev in the final location and then umount
from the old location.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e5ffe0d2484eb6cbcefcaeb5d52020b1130871a5)

ceph.spec: include by-partuuid udev workaround rules

These are need for old or buggy udev. Having them for new and unbroken
udev is harmless.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit f3234c147e083f2904178994bc85de3d082e2836)

ceph-disk: work around buggy rhel/centos parted

parted on RHEL/Centos prefixes the *machine readable output* with

1b 5b 3f 31 30 33 34 68

Note that the same thing happens when you 'import readline' in python.

Work around it!

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 82ff72f827b9bd7f91d30a09d35e42b25d2a7344)

ceph-disk: implement 'activate-journal'

Activate an osd via its journal device. udev populates its symlinks and
triggers events in an order that is not related to whether the device is
an osd data partition or a journal. That means that triggering
'ceph-disk activate' can happen before the journal (or journal symlink)
is present and then fail.

Similarly, it may be that they are on different disks that are hotplugged
with the journal second.

This can be wired up to the journal partition type to ensure that osds are
started when the journal appears second.

Include the udev rules to trigger this.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a2a78e8d16db0a71b13fc15457abc5fe0091c84c)

ceph-disk: call partprobe outside of the prepare lock; drop udevadm settle

After we change the final partition type, sgdisk may or may not trigger a
udev event, depending on how well udev is behaving (it varies between
distros, it seems).  The old code would often settle and wait for udev to
activate the device, and then partprobe would uselessly fail because it
was already mounted.

Call partprobe only at the very end, after prepare is done.  This ensures
that if partprobe calls udevadm settle (which is sometimes does) we do not
get stuck.

Drop the udevadm settle.  I'm not sure what this accomplishes; take it out,
at least until we determine we need it.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 8b3b59e01432090f7ae774e971862316203ade68)

ceph-disk: add 'zap' command

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 10ba60cd088c15d4b4ea0b86ad681aa57f1051b6)

ceph-disk: fix stat errors with new suppress code

Broken by 225fefe5e7c997b365f481b6c4f66312ea28ed61.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit bcc8bfdb672654c6a6b48a2aa08267a894debc32)

ceph-disk: add '[un]suppress-activate <dev>' command

It is often useful to prepare but not activate a device, for example when
preparing a bunch of spare disks. This marks a device as 'do not
activate' so that it can be prepared without activating.

Fixes: #3255
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 225fefe5e7c997b365f481b6c4f66312ea28ed61)

upstart: start ceph-all on runlevel [2345]

Starting when only one network interface has started breaks machines with
multiple nics in very problematic ways.

There may be an earlier trigger that we can use for cases where other
services on the local machine depend on ceph, but for now this is better
than the existing behavior.

See #5248

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 7e08ed1bf154f5556b3c4e49f937c1575bf992b8)

client: set issue_seq (not seq) in cap release

We regularly have been observing a stall where the MDS is blocked waiting
for a cap revocation (Ls, in our case) and never gets a reply.  We finally
tracked down the sequence:

- mds issues cap seq 1 to client
- mds does revocation (seq 2)
- client replies
- much time goes by
- client trims inode from cache, sends release with seq == 2
- mds ignores release because its issue_seq is 1
- mds later tries to revoke other caps
- client discards message because it doesn't have the inode in cache

The problem is simply that we are using seq instead of issue_seq in the
cap release message.  Note that the other release call site in
encode_inode_release() is correct.  That one is much more commonly
triggered by short tests, as compared to this case where the inode needs to
get pushed out of the client cache.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 9b012e234a924efd718826ab6a53b9aeb7cd6649)

osd: skip mark-me-down message if osd is not up

Fixes crash when the OSD has not successfully booted and gets a
SIGINT or SIGTERM.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit c2e262fc9493b4bb22c2b7b4990aa1ee7846940e)

ceph-fuse: create finisher threads after fork()

The ObjectCacher and MonClient classes both instantiate Finisher
threads. We need to make sure they are created *after* the fork(2)
or else the process will fail to join() them on shutdown, and the
threads will not exist while fuse is doing useful work.

Put CephFuse on the heap and move all this initalization into the child
block, and make sure errors are passed back to the parent.

Fix-proposed-by: Alexandre Marangone <alexandre.maragone@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4fa5f99a40792341d247e51488c37301da3c4e4f)

osd: do not include logbl in scrub map

This is a potentially use object/file, usually prefixed by a zeroed region
on disk, that is not used by scrub at all.  It dates back to
f51348dc8bdd5071b7baaf3f0e4d2e0496618f08 (2008) and the original version of
scrub.

This *might* fix #4179.  It is not a leak per se, but I observed 1GB
scrub messages going over the write.  Maybe the allocations are causing
fragmentation, or the sub_op queues are growing.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 0b036ecddbfd82e651666326d6f16b3c000ade18)

rgw: handle deep uri resources

In case of deep uri resources (ones created beyond a single level
of hierarchy, e.g. auth/v1.0) we want to create a new empty
handlers for the path if no handlers exists. E.g., for
auth/v1.0 we need to have a handler for 'auth', otherwise
the default S3 handler will be used, which we don't want.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit ad3934e335399f7844e45fcfd17f7802800d2cb3)

rgw: fix get_resource_mgr() to correctly identify resource

Fixes: #5262
The original test was not comparing the correct string, ended up
with the effect of just checking the substring of the uri to match
the resource.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 8d55b87f95d59dbfcfd0799c4601ca37ebb025f5)

rgw: add 'cors' to the list of sub-resources

Fixes: #5261
Backport: cuttlefish
Add 'cors' to the list of sub-resources, otherwise auth signing
is wrong.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 9a0a9c205b8c24ca9c1e05b0cf9875768e867a9e)

mon: fix preforker exit behavior behavior

In 3c5706163b72245768958155d767abf561e6d96d we made exit() not actually
exit so that the leak checking would behave for a non-forking case.
That is only needed for the normal exit case; every other case expects
exit() to actually terminate and not continue execution.

Instead, make a signal_exit() method that signals the parent (if any)
and then lets you return. exit() goes back to it's usual behavior,
fixing the many other calls in main().

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 92d085f7fd6224ffe5b7651c1f83b093f964b5cd)

rados.py: correct some C types

trunc was getting size_t instead of uint64_t, leading to bad results
in 32-bit environments. Explicitly cast to the desired type
everywhere, so it's clear the correct type is being used.

Fixes: #5233
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 6dd7d469000144b499af84bda9b735710bb5cec3)

v0.61.3

os/LevelDBStore: only remove logger if non-null

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ce67c58db7d3e259ef5a8222ef2ebb1febbf7362)
Fixes: #5255

test_librbd: use correct type for varargs snap test

uint64_t is passed in, but int was extracted. This fails on 32-bit builds.

Fixes: #5220
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 17029b270dee386e12e5f42c2494a5feffd49b08)

os/LevelDBStore: fix merge loop

We were double-incrementing p, both in the for statement and in the
body. While we are here, drop the unnecessary else's.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit eb6d5fcf994d2a25304827d7384eee58f40939af)

msgr: add get_messenger() to Connection

This was part of commit 27381c0c6259ac89f5f9c592b4bfb585937a1cfc.

Signed-off-by: Sage Weil <sage@inktank.com>

mon: start lease timer from peon_init()

In the scenario:

- leader wins, peons lose
- leader sees it is too far behind on paxos and bootstraps
- leader tries to sync with someone, waits for a quorum of the others
- peons sit around forever waiting

The problem is that they never time out because paxos never issues a lease,
which is the normal timeout that lets them detect a leader failure.

Avoid this by starting the lease timeout as soon as we lose the election.
The timeout callback just does a bootstrap and does not rely on any other
state.

I see one possible danger here: there may be some "normal" cases where the
leader takes a long time to issue its first lease that we currently
tolerate, but won't with this new check in place. I hope that raising
the lease interval/timeout or reducing the allowed paxos drift will make
that a non-issue. If it is problematic, we will need a separate explicit
"i am alive" from the leader while it is getting ready to issue the lease
to prevent a live-lock.

Backport: cuttlefish, bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit f1ccb2d808453ad7ef619c2faa41a8f6e0077bd9)

mon: discard messages from disconnected clients

If the client is not connected, discard the message. They will
reconnect and resend anyway, so there is no point in processing it
twice (now and later).

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit fb3cd0c2a8f27a1c8d601a478fd896cc0b609011)

msgr: add Messenger reference to Connection

This allows us to get the messenger associated with a connection.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 92a558bf0e5fee6d5250e1085427bff22fe4bbe4)

mon/Paxos: adjust trimming defaults up; rename options

- trim more at a time (by an order of magnitude)
- rename fields to paxos_trim_{min,max}; only trim when there are min items
that are trimmable, and trim at most max items at a time.
- adjust the paxos_service_trim_{min,max} values up by a factor of 2.

Since we are compacting every time we trim, adjusting these up mean less
frequent compactions and less overall work for the monitor.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 6b8e74f0646a7e0d31db24eb29f3663fafed4ecc)

common/Preforker: fix warnings

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a284c9ece85f11d020d492120be66a9f4c997416)

fix test users of LevelDBStore

Need to pass in cct.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 446e0770c77de5d72858dcf7a95c5b19f642cf98)

mon: destroy MonitorDBStore before g_ceph_context

Put it on the heap so that we can destroy it before the g_ceph_context
cct that it references. This fixes a crash like

*** Caught signal (Segmentation fault) **
in thread 4034a80
ceph version 0.63-204-gcf9aa7a (cf9aa7a0037e56eada8b3c1bb59d59d0bfe7bba5)
1: ceph-mon() [0x59932a]
2: (()+0xfcb0) [0x4e41cb0]
3: (Mutex::Lock(bool)+0x1b) [0x6235bb]
4: (PerfCountersCollection::remove(PerfCounters*)+0x27) [0x6a0877]
5: (LevelDBStore::~LevelDBStore()+0x1b) [0x582b2b]
6: (LevelDBStore::~LevelDBStore()+0x9) [0x582da9]
7: (main()+0x1386) [0x48db16]
8: (__libc_start_main()+0xed) [0x658076d]
9: ceph-mon() [0x4909ad]

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit df2d06db6f3f7e858bdadcc8cd2b0ade432df413)

mon: fix leak of health_monitor and config_key_service

Switch to using regular pointers here. The lifecycle of these services is
very simple such that refcounting is overkill.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit c888d1d3f1b77e62d1a8796992e918d12a009b9d)

mon: return instead of exit(3) via preforker

This lets us run all the locally-scoped dtors so that leak checking will
work.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 3c5706163b72245768958155d767abf561e6d96d)

os/LevelDBStore: add perfcounters

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 7802292e0a49be607d7ba139b44d5ea1f98e07e6)

mon: make compaction bounds overlap

When we trim items N to M, compact over range (N-1) to M so that the
items in the queue will share bounds and get merged. There is no harm in
compacting over a larger range here when the lower bound is a key that
doesn't exist anyway.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a47ca583980523ee0108774b466718b303bd3f46)

os/LevelDBStore: merge adjacent ranges in compactionqueue

If we get behind and multiple adjacent ranges end up in the queue, merge
them so that we fire off compaction on larger ranges.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit f628dd0e4a5ace079568773edfab29d9f764d4f0)

mon: compact trimmed range, not entire prefix

This will reduce the work that leveldb is asked to do by only triggering
compaction of the keys that were just trimmed.

We ma want to further reduce the work by compacting less frequently, but
this is at least a step in that direction.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 6da4b20ca53fc8161485c8a99a6b333e23ace30e)

mon/MonitorDBStore: allow compaction of ranges

Allow a transaction to describe the compaction of a range of keys. Do this
in a backward compatible say, such that older code will interpret the
compaction of a prefix + range as compaction of the entire prefix. This
allows us to avoid introducing any new feature bits.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ab09f1e5c1305a64482ebbb5a6156a0bb12a63a4)

Conflicts:

src/mon/MonitorDBStore.h

os/LevelDBStore: allow compaction of key ranges

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e20c9a3f79ccfeb816ed634ca25de29fc5975ea8)

os/LevelDBStore: do compact_prefix() work asynchronously

We generally do not want to block while compacting a range of leveldb.
Push the blocking+waiting off to a separate thread. (leveldb will do what
it can to avoid blocking internally; no reason for us to wait explicitly.)

This addresses part of #5176.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4af917d4478ec07734a69447420280880d775fa2)

qa: rsync test: exclude /usr/local

Some plana have non-world-readable crap in /usr/local/samba. Avoid
/usr/local entirely for that and any similar landmines.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 82211f2197241c4f3d3135fd5d7f0aa776eaeeb6)

mon: fix uninitialized fields in MMonHealth

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit d7e2ab1451e284cd4273cca47eec75e1d323f113)

PGLog: only add entry to caller_ops in add() if reqid_is_indexed()

Fixes: #5216
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

PG: don't write out pg map epoch every handle_activate_map

We don't actually need to write out the pg map epoch on every
activate_map as long as:
a) the osd does not trim past the oldest pg map persisted
b) the pg does update the persisted map epoch from time
to time.

To that end, we now keep a reference to the last map persisted.
The OSD already does not trim past the oldest live OSDMapRef.
Second, handle_activate_map will trim if the difference between
the current map and the last_persisted_map is large enough.

Fixes: #4731
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 2c5a9f0e178843e7ed514708bab137def840ab89)

Conflicts:

src/common/config_opts.h
src/osd/PG.cc
- last_persisted_osdmap_ref gets set in the non-static
PG::write_info

upstart: handle upper case in cluster name and id

Signed-off-by: Alexandre Marangone <alexandre.marangone@inktank.com>
(cherry picked from commit 851619ab6645967e5d7659d9b0eea63d5c402b15)

OSDMonitor: skip new pools in update_pools_status() and get_pools_health()

New pools won't be full. mon->pgmon()->pg_map.pg_pool_sum[poolid] will
implicitly create an entry for poolid causing register_new_pgs() to assume that
the newly created pgs in the new pool are in fact a result of a split
preventing MOSDPGCreate messages from being sent out.

Fixes: #4813
Backport: cuttlefish
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 0289c445be0269157fa46bbf187c92639a13db46)

rgw: only append prefetched data if reading from head

Fixes: #5209
Backport: bobtail, cuttlefish
If the head object wrongfully contains data, but according to the
manifest we don't read from the head, we shouldn't copy the prefetched
data. Also fix the length calculation for that data.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit c5fc52ae0fc851444226abd54a202af227d7cf17)

rgw: don't copy object idtag when copying object

Fixes: #5204
When copying object we ended up also copying the original
object idtag which overrode the newly generated one. When
refcount put is called with the wrong idtag the count
does't go down.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit b1312f94edc016e604f1d05ccfe2c788677f51d1)

debian: sync up postinst and prerm with latest

- do not use invoke-rc.d for upstart
- do not stop daemons on upgrade
- misc other cleanups

This corresponds to the state of master as of cf9aa7a.

Signed-off-by: Sage Weil <sage@inktank.com>

mon: Monitor: backup monmap using all ceph features instead of quorum's

When a monitor is freshly created and for some reason its initial sync is
aborted, it will end up with an incorrect backup monmap. This monmap is
incorrect in the sense that it will not contain the monitor's names as
it will expect on the next run.

This results from us being using the quorum features to encode the monmap
when backing it up, instead of CEPH_FEATURES_ALL.

Fixes: #5203
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 626de387e617db457d6d431c16327c275b0e8a34)

osd: do not assume head obc object exists when getting snapdir

For a list-snaps operation on the snapdir, do not assume that the obc for the
head means the object exists. This fixes a race between a head deletion and
a list-snaps that wrongly returns ENOENT, triggered by the DiffItersateStress
test when thrashing OSDs.

Fixes: #5183
Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 29e4e7e316fe3f3028e6930bb5987cfe3a5e59ab)

osd: initialize new_state field when we use it

If we use operator[] on a new int field its value is undefined; avoid
reading it or using |= et al until we initialize it.

Fixes: #4967
Backport: cuttlefish, bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 50ac8917f175d1b107c18ecb025af1a7b103d634)

HashIndex: sync top directory during start_split,merge,col_split

Otherwise, the links might be ordered after the in progress
operation tag write. We need the in progress operation tag to
correctly recover from an interrupted merge, split, or col_split.

Fixes: #5180
Backport: cuttlefish, bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 5bca9c38ef5187c7a97916970a7fa73b342755ac)

mon: Paxos: get rid of the 'prepare_bootstrap()' mechanism

We don't need it after all. If we are in the middle of some proposal,
then we guarantee that said proposal is likely to be retried. If we
haven't yet proposed, then it's forever more likely that a client will
eventually retry the message that triggered this proposal.

Basically, this mechanism attempted at fixing a non-problem, and was in
fact triggering some unforeseen issues that would have required increasing
the code complexity for no good reason.

Fixes: #5102
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit e15d29094503f279d444eda246fc45c09f5535c9)

mon: Paxos: finish queued proposals instead of clearing the list

By finishing these Contexts, we make sure the Contexts they enclose (to be
called once the proposal goes through) will behave as their were initially
planned:  for instance, a C_Command() may retry the command if a -EAGAIN
is passed to 'finish_contexts', while a C_Trimmed() will simply set
'going_to_trim' to false.

This aims at fixing at least a bug in which Paxos will stop trimming if an
election is triggered while a trim is queued but not yet finished.  Such
happens because it is the C_Trimmed() context that is responsible for
resetting 'going_to_trim' back to false.  By clearing all the contexts on
the proposal list instead of finishing them, we stay forever unable to
trim Paxos again as 'going_to_trim' will stay True till the end of time as
we know it.

Fixes: #4895
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 586e8c2075f721456fbd40f738dab8ccfa657aa8)

mon: Paxos: finish_proposal() when we're finished recovering

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 2ff23fe784245f3b86bc98e0434b21a5318e0a7b)

Merge branch 'wip_scrub_tphandle' into cuttlefish

Fixes: #5159
Reviewed-by: Sage Weil <sage@inktank.com>

PG: ping tphandle during omap loop as well

Signed-off-by: Samuel Just <sam.just@inktank.com>

PG: reset timeout in _scan_list for each object, read chunk

Signed-off-by: Samuel Just <sam.just@inktank.com>

OSD,PG: pass tphandle down to _scan_list

Signed-off-by: Samuel Just <sam.just@inktank.com>

rgw: iterate usage entries from correct entry

Fixes: #5152
When iterating through usage entries, and when user id was
provided, we started at the user's first entry and not from
the entry indexed by the request start time.
This commit fixes the issue.

Backport: bobtail

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 8b3a04dec8be13559716667d4b16cde9e9543feb)

sysvinit: fix enumeration of local daemons when specifying type only

- prepend $local to the $allconf list at the top
- remove $local special case for all case
- fix the type prefix checks to explicitly check for prefixes

Fugly bash, but works!

Backport: cuttlefish, bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit c80c6a032c8112eab4f80a01ea18e1fa2c7aa6ed)

sysvinit: fix osd weight calculation on remote hosts

We need to do df on the remote host, not locally.

Simlarly, the ceph command uses the osd key, which exists remotely; run it there.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit d81d0ea5c442699570bd93a90bea0d97a288a1e9)

sysvinit: use known hostname $host instead of (incorrectly) recalculating

We would need to do hostname -s on the remote node, not the local one.
But we already have $host; use it!

Reported-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit caa15a34cb5d918c0c8b052cd012ec8a12fca150)

mon: be a bit more verbose about osd mark down events

Put these in the cluster log; they are interesting.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 87767fb1fb9a52d11b11f0b641cebbd9998f089e)

PG: subset_last_update must be at least log.tail

Fixes: 5020
Backport: bobtail, cuttlefish
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 72bf5f4813c273210b5ced7f7793bc1bf813690c)

FileJournal: adjust write_pos prior to unlocking write_lock

In committed_thru, we use write_pos to reset the header.start value in cases
where seq is past the end of our journalq. It is therefore important that the
journalq be updated atomically with write_pos (that is, under the write_lock).

The call to align_bl() is moved into do_write in order to ensure that write_pos
is adjusted correctly prior to write_bl().

Also, we adjust pos at the end of write_bl() such that pos \in [get_top(),
header.max_size) after write_bl().

Fixes: #5020
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit eaf3abf3f9a7b13b81736aa558c9084a8f07fdbe)

mon: implement --extract-monmap <filename>

This will make for a simpler process for
http://ceph.com/docs/master/rados/operations/add-or-rm-mons/#removing-monitors-from-an-unhealthy-cluster

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit c0268e27497a4d8228ef54da9d4ca12f3ac1f1bf)

librbd: make image creation defaults configurable

Programs using older versions of the image creation functions can't
set newer parameters like image format and fancier striping.

Setting these options lets them use all the new functionality without
being patched and recompiled to use e.g. rbd_create3().
This is particularly useful for things like qemu-img, which does not
know how to create format 2 images yet.

Refs: #5067
backport: cuttlefish, bobtail
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit aacc9adc4e9ca90bbe73ac153cc754a3a5b2c0a1)

rbd.py: fix stripe_unit() and stripe_count()

These matched older versions of the functions, but would segfault
using the current versions.

backport: cuttlefish, bobtail
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 53ee6f965e8f06c7256848210ad3c4f89d0cb5a0)

cls_rbd: make sure stripe_unit is not larger than object size

Test a few other cases too.

backport: cuttlefish, bobtail
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 810306a2a76eec1c232fd28ec9c351e827fa3031)

rgw: protect ops log socket formatter

Fixes: #4905
Ops log (through the unix domain socket) uses a formatter, which wasn't
protected.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit d48f1edb07a4d8727ac956f70e663c1b4e33e1dd)

Makefle: force char to be signed

On an armv7l build, we see errors like

warning: rgw/rgw_common.cc:626:16: comparison is always false due to limited range of data type [-Wtype-limits]

from code

char c1 = hex_to_num(*src++);
...
if (c1 < 0)

Force char to be signed (regardless of any weird architecture's default)
to avoid risk of this leading to misbehavior.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 769a16d6674122f3b537f03e17514ad974bf2a2f)

debian: stop sysvinit on ceph.prerm

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 2f193fb931ed09d921e6fa5a985ab87aa4874589)

ceph df: fix si units for 'global' stats

si_t expects bytes, but it was being given kilobytes.

Signed-off-by: Mike Kelly <pioto@pioto.org>
(cherry picked from commit 0c2b738d8d07994fee4c73dd076ac9364a64bdb2)

udev: install disk/by-partuuid rules

Wheezy's udev (175-7.2) has broken rules for the /dev/disk/by-partuuid/
symlinks that ceph-disk relies on. Install parallel rules that work. On
new udev, this is harmless; old older udev, this will make life better.

Fixes: #4865
Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit d8d7113c35b59902902d487738888567e3a6b933)

debian: make radosgw require matching version of librados2

...indirectly via ceph-common. We get bad behavior when they diverge, I
think because of libcommon.la being linked both statically and dynamically.

Fixes: #4997
Backport: cuttlefish, bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Gary Lowell <gary.lowell@inktank.com>
(cherry picked from commit 604c83ff18f9a40c4f44bc8483ef22ff41efc8ad)

mon: fix validatation of mds ids in mon commands

Fixes: #4996
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 5c305d63043762027323052b4bb3ae3063665c6f)

v0.61.2