git.apps.os.sepia.ceph.com Git

rgw: try to create log pool if doesn't exist

When using replica log, if the log pool doesn't exist all operations are
going to fail. Try to create it if doesn't exist.

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 4216eac0f59af60f60d4ce909b9ace87a7b64ccc)

formatter: dump_bool dumps unquoted strings

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit ad409f8a6d230e9b1199226a333bb54159c2c910)

Formatter: add dump_bool()

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 28949d5d43beba7cef37cb2f83e3399d978061a6)

rgw: escape bucket and object names in StreamReadRequests

This fixes copy operations for objects that contain unsafe characters,
like a newline, which would return a 403 otherwise, since the GET to
the source rgw would be unable to verify the signature on a partially
valid bucket name.

Fixes: #6604
Backport: dumpling
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit ec45b3b88c485140781b23d2c4f582f2cc26ea43)

rgw: move url escaping to a common place

This is useful outside of the s3 interface. Rename url_escape()
url_encode() for consistency with the exsting common url_decode()
function. This is in preparation for the next commit, which needs
to escape url-unsafe characters in another place.

Backport: dumpling
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit dd308cd481b368f90a64220847b91fc233d92a59)

rgw: update metadata log list to match data log list

Send the last marker whether the log is truncated in the same format
as data log list, so clients don't have more needless complexity
handling the difference. Keep bucket index logs the same, since they
contain the marker already, and are not used in exactly the same way
metadata and data logs are.

Backport: dumpling
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit e0e8fb1b2b4a308b2a9317e10c6fd53ad48dbfaf)

rgw: include marker and truncated flag in data log list api

Consumers of this api need to know their position in the log. It's
readily available when fetching the log, so return it. Without the
marker in this call, a client could not easily or efficiently figure
out its position in the log, since it would require getting the global
last marker in the log, and then reading all the log entries.

This would be slow for large logs, and would be subject to races that
would cause potentially very expensive duplicate work.

Returning this atomically while fetching the log entries simplifies
all of this.

Fixes: #6615
Backport: dumpling
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit c275912509255f8bb4c854e181318b45ab0f8564)

cls_log: always return final marker from log_list

There's no reason to restrict returning the marker to the case where
less than the whole log is returned, since there's already a truncated
flag to tell the client what happened.

Giving the client the last marker makes it easy to consume when the
log entries do not contain their own marker. If the last marker is not
returned, the client cannot get the last marker without racing with
updates to the log.

Backport: dumpling
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit e74776f4176470122485a79a4c07e9c12c9fc036)

rgw: skip read_policy checks for system_users

A system user should still be able to examine suspended buckets, and
get -ENOENT instead of -EACCESS for a deleted object.

Fixes: #6616
Backport: dumpling
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit ea816c1c2fd47eab647d6fab96c9ca4bfeecd5bb)

common/crc32c: fix #ifdef to be x86_64 specific

Signed-off-by: Sage Weil <sage@inktank.com>

rgw: fix rgw test to reflect usage change

otherwise src/test/cli/radosgw-admin/help.t fails when running make
check when run after a configure --with-radosgw

Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit e50343e4423e20130035c860ba47a0edea876f7c)

rbd.py: increase parent name size limit

64 characters isn't all that long. 4096 ought to be enough for anyone.

Fixes: #6072
Backport: dumpling, cuttlefish
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 3c0042cde5a12de0f554a16b227ab437c6254ddd)

common/config: include --cluster in default usage message

Clean it up a bit too.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 157754b3a0191c5ff534a84adbeed88025615898)

mds: fix infinite loop of MDCache::populate_mydir().

make MDCache::populate_mydir() only fetch bare-bone stray dirs.
After all stray dirs are populated, call MDCache::scan_stray_dir(),
it fetches incomplete stray dirs.

Fixes: #4405
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 007f06ec174d4ee5cfb578c8b3f1c96b2bb0c238)

Conflicts:

src/mds/MDCache.h

Reviewed-by: Greg Farnum <greg@inktank.com>

rgw: fix authenticated users acl group check

Fixes: #6553
Backport: bobtail, cuttlefish, dumpling
Authenticated users group acl bit was not working correctly. Check to
test whether user is anonymous was wrong.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit bebbd6cb7b71697b34b8f27652cabdc40c97a33b)

rgw: change default log level

Fixes: #6554
Backport: cuttlefish, dumpling
Default log level was just too high, bring it down a bit.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 8d7dbf85472cfca9268d81ecf057ea078cf345b3)

rgw: swift update obj metadata also add generic attrs

Fixes: #6462
We were missing the generic attributes when we updated the object
metadata (operation that only exists in the swift api).

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit f2645e1c6d7383a0ace3b239f4304e353249c4bb)

mds: return -EAGAIN if standby replay falls behind

standby replay may fall behind and get -ENOENT when reading the
journal. return -EAGAIN in this case, it makes the MDS respawn itself.

fixes: #5458

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit d2cb2bf6bac83ac6db9df6cb876317d30e7493cc)
Reviewed-by: Greg Farnum <greg@inktank.com>

mon/MDSMonitor: don't reset incarnation when creating newfs

Fixes: #6279
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
(cherry picked from commit 40613b700b87b495d67793101ae05d6ba58b2a9a)
Reviewed-by: Greg Farnum <greg@inktank.com>

osdc/ObjectCacher: finish contexts after dropping object reference

The context to finish can be class C_Client_PutInode, which may drop
inode's last reference. So we should first drop object's reference,
then finish contexts.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
(cherry picked from commit b66ac77fa7aa3ff37804918c4308a348f239af09)

Go back to $PWD in fsstress.sh if compiling from source.

Although fsstress was being called with a static path the directory
it was writing to was in the current directory so doing a cd to the
source directory that is made in /tmp and then removing it later
caused it to be unable to write the files in a non-existent dir.

This change gets the current path first and cd's back into it after
it is done compiling fsstress.

Issue #6479.

Signed-off-by: Sandon Van Ness <sandon@inktank.com>
Reviewed-by: Alfredo Deza <alfredo.deza@inktank.com>

ceph.spec.in: radosgw package doesn't require mod_fcgi

Fixes #5702

Signed-off-by: Gary Lowell <gary.lowell@inktank.com>

Merge branch 'dumpling' of jenkins:ceph/ceph into dumpling

ceph_test_rados: do not let rollback race with snap delete

Note that the OSD behaves in a weird way when you rollback to a non-
existent snap, so the test probably isn't the only party at fault here.

Fixes (test half of): #6254
Backport: dumpling, cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 55d279b98553ba4542219b126fc7159b20b18b1f)

Conflicts:

src/test/osd/RadosModel.h
src/test/osd/TestRados.cc

v0.67.4

rgw: fix keystone token expiration test

Fixes: #6360
The test was inverted, need expiration to be greater than
current time in order for token to be valid.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

osd/OSD.cc: Use MIN() so that we don't exceed osd_recovery_max_active

Caused by 944f3b73531af791c90f0f061280160003545c63

Fixes: #6291
Backport: dumpling

Signed-off-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 139a714e13aa3c7f42091270b55dde8a17b3c4b8)

Conflicts:

src/osd/OSD.cc

mon: OSDMonitor: do not write full_latest during trim

On commit 81983bab we patched OSDMonitor::update_from_paxos() such that we
write the latest full map version to 'full_latest' each time the latest
full map was built from the incremental versions.

This change however clashed with OSDMonitor::encode_trim_extra(), which
also wrote to 'full_latest' on each trim, writing instead the version of
the *oldest* full map. This duality of behaviors could lead the store
to an inconsistent state across the monitors (although there's no sign of
it actually imposing any issues besides rebuilding already existing full
maps on some monitors).

We now stop OSDMonitor::encode_trim_extra() from writing to 'full_latest'.
This function will still write out the oldest full map it has in the store,
but it will no longer write to full_latest, instead leaving it up to
OSDMonitor::update_from_paxos() to figure it out -- and it already does.

Fixes: #6378
Backport: dumpling

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit bd0f29a2c28cca496ec830eac932477ebf3182ba)

crush: invalidate rmap on create (and thus decode)

If we have an existing CrushWrapper object and decode from a bufferlist,
reset build_rmaps so that they get rebuilt.

Remove the build_rmaps() all in decode that was useless on a redecode
(because have_rmaps == true in that case and it did nothing).

Fixes: #6442
Backport: dumpling, maybe cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 9b7a2ae329b6a511064dd3d6e549ba61f52cfd21)

Invoke python with /usr/bin/env python instead of directly

Fixes: #6311
Signed-off-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit b9000b314b9166845ff302d4a827a996775d9a14)

qa/workunits/mon/crush_ops.sh: fix test

Fix root.

Fixes: #6392
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit c8cae87e9e08468cc86145e0fd60c05d12826239)

Revert "ceph: parse CEPH_ARGS environment variable"

This reverts commit 67a95b9880c9bc6e858150352318d68d64ed74ad.

We now put CEPH_ARGS in the actual args we parse in python, which are passed
to rados piecemeal later. This lets you put things like --id ... in there
that need to be parsed before librados is initialized.
(cherry picked from commit 97f462be4829f0167ed3d65e6694dfc16f1f3243)

Add CEPH_ARGS at the end of sys.argv

This allows, for instance, to pass a different client name to ceph by
exporting CEPH_ARGS="--id client_id".

Signed-off-by: Benoît Knecht <benoit.knecht@fsfe.org>
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 30abe3244c86cbbe1f5b005850c29c9c0eafcad4)

mon/OSDMonitor: fix 'ceph osd crush reweight ...'

The adjust method returns a count of adjusted items.

Add a test.

Fixes: #6382
Backport: dumpling
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 3de32562b55c6ece3a6ed783c36f8b9f21460339)

qa: workunits: mon: crush_ops: test 'ceph osd crush move'

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 3bc618b7b46496c5110edde0da9cae5d3e68e0e1)

osd: change warn_interval_multiplier to uint32_t

to prevent overflow in OpTracker::check_ops_in_flight when
multiplying warn_interval_multiplier *= 2

Backport: cuttlefish, dumpling

http://tracker.ceph.com/issues/6370 fixes #6370

Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit 1bce1f009bffd3e28025a08775fec189907a81db)

crushtool: do not dump core with non-unique bucket IDs

Return -EEXIST on duplicate ID
BUG FIX: crush_add_bucket() mixes error returns and IDs
Add optional argument to return generated ID

Fixes: #6246
Signed-off-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 8c76f3a0f9cf100ea2c941dc2b61c470aa5033d7)

qa: workunits: cephtool: check if 'heap' commands are parseable

Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
(cherry picked from commit b1eeaddd5f214c1b0883b44fc8cae07c649be7c4)

osd: OSD: add 'heap' command to known osd commands array

Must have been forgotten during the cli rework.

Backport: dumpling

Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
(cherry picked from commit 296f2d0db31e9f5a59a3a62a1e95b6c440430fa3)

mds: MDS: pass only heap profiler commands instead of the whole cmd vector

The heap profiler doesn't care, nor should it, what our command name is.
It only cares about the commands it handles.

Backport: dumpling

Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
(cherry picked from commit 238fe272c6bdb62d4e57fd8555c0136de99c8129)

perfglue/heap_profiler.cc: expect args as first element on cmd vector

We used to pass 'heap' as the first element of the cmd vector when
handling commands. We haven't been doing so for a while now, so we
needed to fix this.

Not expecting 'heap' also makes sense, considering that what we need to
know when we reach this function is what command we should handle, and
we should not care what the caller calls us when handling his business.

Fixes: #6361
Backport: dumpling

Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
(cherry picked from commit c98b910d49bd2b46ceafdc430044a31524c29f5b)

rgw: destroy get_obj handle in copy_obj()

Fixes: #6176
Backport: dumpling
We take different code paths in copy_obj, make sure we close the handle
when we exit the function. Move the call to finish_get_obj() out of
copy_obj_data() as we don't create the handle there, so that should
makes code less confusing and less prone to errors.
Also, note that RGWRados::get_obj() also calls finish_get_obj(). For
everything to work in concert we need to pass a pointer to the handle
and not the handle itself. Therefore we needed to also change the call
to copy_obj_data().

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 9e98620e4325d15c88440a890b267131613e1aa1)

mon: MonCommands: expect a CephString as 1st arg for 'osd crush move'

Fixes: #6230
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 7d3799fde19138f957f26ec6be10a8a0000fc1f0)

osd: revert 'osd max xattr size' limit

Set it to 0 (unlimited) for now.

Backport: dumpling

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit abb88d70643c3a76435b7a9d5b04ff29f7502361)

mds: be more careful about decoding LogEvents

We need to wrap the full decode section or we can abort the process
if there's an issue (which we may want to just skip by).

Signed-off-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 73289b34b0be5b6612e38944794d59b5e789f841)

mon: OSDMonitor: multiple rebuilt full maps per transaction

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 0d20cae0be701c5b6151a26ee5e4fe24d89aa20a)

mon: OSDMonitor: update latest_full while rebuilding full maps

Not doing so will make the monitor rebuild the osdmap full versions, even
though they may have been rebuilt before, every time the monitor starts.

This mostly happens when the cluster is left in an unhealthy state for
a long period of time and incremental versions build up. Even though we
build the full maps on update_from_paxos(), not updating 'full_latest'
leads to the situation initially described.

Fixes: #6322
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 81983bab3630520d6c7ee9b7e4a747bc17b8c5c3)

mon: OSDMonitor: smaller transactions when rebuilding full versions

Otherwise, for considerably sized rebuilds, the monitor will not only
consume vast amounts of memory, but it will also have troubles committing
the transaction. Anyway, it's also a good idea to adjust transactions to
the granularity we want, and to be fair we care that each rebuilt full map
gets to disk, even if subsequent full maps don't (those can be rebuilt
later).

Fixes: #6323
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 4ac1570c5cdcd6556dc291cc6d7878fd92d343ae)

mon: OSDMonitor: check if pool is on unmanaged snaps mode on mk/rmsnap

Backport: dumpling
Fixes: #6047
Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
(cherry picked from commit fab79543c54c2e446d3f76520d7906645c6b0075)

lru_map: don't use list::size()

replace list::size() with map::size(), which should have
a constant time complexity.

Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 7c1d2ded8fa8061bf3f14932800998b963745dd1)

common/lru_map: rename tokens to entries

This code was originally used in a token cache, now
as a generic infrastructure rename token fields.

Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 532e41a9985a16b35a6e49cdcba38af0ad166fa8)

rgw: use bufferlist::append() instead of bufferlist::push_back()

push_back() expects char *, whereas append can append a single char.
Appending a NULL char to push_back is cast as a NULL pointer which is
bad.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 08fe028bad13096d482454a2f303158727c363ff)

rgw: NULL terminate buffer before parsing it

Fixes: #6175
Backport: dumpling
We get a buffer off the remote gateway which might
not be NULL terminated. The JSON parser needs the
buffer to be NULL terminated even though we provide
a buffer length as it calls strlen().

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit e7f7483192cddca1159aba439ce62b1e78669d51)

rgw: don't call list::size() in ObjectCache

Fixes: #6286
Use an external counter instead of calling list::size()

Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 31e3a51e933429d286104fe077e98ea883437ad6)

rgw: drain pending requests before completing write

Fixes: #6268
When doing aio write of objects (either regular or multipart parts) we
need to drain pending aio requests. Otherwise if gateway goes down then
object might end up corrupted.

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 626669afaa333d73707553a85f5c874e99e9cbd8)

rgw: fix get cors, delete cors

Remove a couple of variables that overrode class member. Not
really clear how it was working before, might have been a bad
merge / rebase.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 13872785aeeddbe1b8dd97e49fd6a2d879514f8d)

Merge branch 'wip-6078-dumpling' into dumpling

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

rgw: fix certain return status cases in CORS

Change return values in certain cases, reorder
checks, etc.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

rgw: add COPY method to be handled by CORS

Was missing this http method.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

rgw: fix CORS rule check

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

rgw: don't handle CORS if rule not found (is NULL)

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

rgw: tie CORS header response to all relevant operations

Have the CORS responses on all relevant operations. Also add headers
on failure cases.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

rgw: add a generic CORS response handling

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

rgw: OPTIONS request doesn't need to read object info

This is a bucket-only operation, so we shouldn't look at the
object. Object may not exist and we might respond with Not
Exists response which is not what we want.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

rgw: remove use of s->bucket_cors

Some old code still tried to use s->bucket_cors, which was
abandoned in a cleanup work.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

v0.67.3

Merge pull request #574 from dalgaaf/fix/da-dumpling-cherry-picks

init-radosgw*: fix status return value if radosgw isn't running

Reviewed-by: Sage Weil <sage@inktank.com>

init-radosgw*: fix status return value if radosgw isn't running

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
(cherry picked from commit b5137baf651eaaa9f67e3864509e437f9d5c3d5a)

FileStore: add config option to disable the wbthrottle

Backport: dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 3528100a53724e7ae20766344e467bf762a34163)

WBThrottle: use fdatasync instead of fsync

Backport: dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit d571825080f0bff1ed3666e95e19b78a738ecfe8)

PGLog: initialize writeout_from in PGLog constructor

Fixes: 6151
Backport: dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
Introduced: f808c205c503f7d32518c91619f249466f84c4cf
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 42d65b0a7057696f4b8094f7c686d467c075a64d)

PGLog: maintain writeout_from and trimmed

This way, we can avoid omap_rmkeyrange in the common append
and trim cases.

Fixes: #6040
Backport: Dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit f808c205c503f7d32518c91619f249466f84c4cf)

PGLog: don't maintain log_keys_debug if the config is disabled

Fixes: #6040
Backport: Dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 1c0d75db1075a58d893d30494a5d7280cb308899)

PGLog: move the log size check after the early return

There really are stl implementations (like the one on my ubuntu 12.04
machine) which have a list::size() which is linear in the size of the
list. That assert, therefore, is quite expensive!

Fixes: #6040
Backport: Dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit fe68b15a3d82349f8941f5b9f70fcbb5d4bc7f97)

rbd.cc: relicense as LGPL2

All past authors for rbd.cc have consented to relicensing from GPL to
LGPL2 via email:

---

Date: Sat, 27 Jul 2013 01:59:36 +0200
From: Sylvain Munaut <s.munaut@whatever-company.com>
Subject: Re: Ceph rbd.cc GPL -> LGPL2 license change

I hereby consent to the relicensing of any contribution I made to the
aforementioned rbd.cc file from GPL to LGPL2.1.

(I hope that'll be impressive enough, I did my best :p)

btw, tnt@246tNt.com and s.munaut@whatever-company.com are both me.

Cheers,

Sylvain

---

Date: Fri, 26 Jul 2013 17:00:48 -0700
From: Yehuda Sadeh <yehuda@inktank.com>
Subject: Re: Ceph rbd.cc GPL -> LGPL2 license change

I consent.

---

Date: Fri, 26 Jul 2013 17:02:24 -0700
From: Josh Durgin <josh.durgin@inktank.com>
Subject: Re: Ceph rbd.cc GPL -> LGPL2 license change

I consent.

---

Date: Fri, 26 Jul 2013 18:17:46 -0700
From: Stanislav Sedov <stas@freebsd.org>
Subject: Re: Ceph rbd.cc GPL -> LGPL2 license change

I consent.

Thanks for taking care of it!

---

Date: Fri, 26 Jul 2013 18:24:15 -0700
From: Colin McCabe <cmccabe@alumni.cmu.edu>

I consent.

cheers,
Colin

---

Date: Sat, 27 Jul 2013 07:08:12 +0200
From: Christian Brunner <christian@brunner-muc.de>
Subject: Re: Ceph rbd.cc GPL -> LGPL2 license change

I consent

Christian

---

Date: Sat, 27 Jul 2013 12:17:34 +0300
From: Stratos Psomadakis <psomas@grnet.gr>
Subject: Re: Ceph rbd.cc GPL -> LGPL2 license change

Hi,

I consent with the GPL -> LGL2.1 re-licensing.

Thanks
Stratos

---

Date: Sat, 27 Jul 2013 16:13:13 +0200
From: Wido den Hollander <wido@42on.com>
Subject: Re: Ceph rbd.cc GPL -> LGPL2 license change

I consent!

You have my permission to re-license the code I wrote for rbd.cc to LGPL2.1

---

Date: Sun, 11 Aug 2013 10:40:32 +0200
From: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Subject: Re: btw

Hi Sage,

I agree to switch the license of ceph_argparse.py and rbd.cc from GPL2
to LGPL2.

Regards

Danny Al-Gaaf

---

Date: Tue, 13 Aug 2013 17:15:24 -0700
From: Dan Mick <dan.mick@inktank.com>
Subject: Re: Ceph rbd.cc GPL -> LGPL2 license change

I consent to relicense any contributed code that I wrote under LGPL2.1 license.

---

...and I consent too. Drop the exception from COPYING and debian/copyright
files.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 2206f55761c675b31078dea4e7dd66f2666d7d03)

rgw: flush pending data when completing multipart part upload

Fixes: #6111
Backport: dumpling
When completing the part upload we need to flush any data that we
aggregated and didn't flush yet. With earlier code didn't have to deal
with it as for multipart upload we didn't have any pending data.
What we do now is we call the regular atomic data completion
function that takes care of it.

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 9a551296e0811f2b65972377b25bb28dbb42f575)

rgw: check object name after rebuilding it in S3 POST

Fixes: #6088
Backport: bobtail, cuttlefish, dumpling

When posting an object it is possible to provide a key
name that refers to the original filename, however we
need to verify that in the end we don't end up with an
empty object name.

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit c8ec532fadc0df36e4b265fe20a2ff3e35319744)

mon/MonClient: release pending outgoing messages on shutdown

This fixes a small memory leak when we have messages queued for the mon
when we shut down. It is harmless except for the valgrind leak check
noise that obscures real leaks.

Backport: dumpling
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 309569a6d0b7df263654b7f3f15b910a72f2918d)

rgw: change watch init ordering, don't distribute if can't

Backport: dumpling

Moving back the watch initialization after the zone init,
as the zone info holds the control pool name. Since zone
init might need to create a new system object (that needs
to distribute cache), don't try to distribute cache if
watch is not yet initialized.

Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 1d1f7f18dfbdc46fdb09a96ef973475cd29feef5)

ceph-post-file: use mktemp instead of tempfile

tempfile is a debian thing, apparently; mktemp is present everywhere.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e60d4e09e9f11e3c34a05cd122341e06c7c889bb)

mon: discover mon addrs, names during election state too

Currently we only detect new mon addrs and names during the probing phase.
For non-trivial clusters, this means we can get into a sticky spot when
we discover enough peers to form an quorum, but not all of them, and the
undiscovered ones are enough to break the mon ranks and prevent an
election.

One way to work around this is to continue addr and name discovery during
the election. We should also consider making the ranks less sensitive to
the undefined addrs; that is a separate change.

Fixes: #4924
Backport: dumpling
Signed-off-by: Sage Weil <sage@inktank.com>
Tested-by: Bernhard Glomm <bernhard.glomm@ecologic.eu>
(cherry picked from commit c24028570015cacf1d9e154ffad80bec06a61e7c)

ceph_rest_api.py: create own default for log_file

common/config thinks the default log_file for non-daemons should be "".
Override that so that the default is
/var/log/ceph/{cluster}-{name}.{pid}.log
since ceph-rest-api is more of a daemon than a client.

Fixes: #6099
Backport: dumpling
Signed-off-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 2031f391c3df68e0d9e381a1ef3fe58d8939f0a8)

ceph-post-file: single command to upload a file to cephdrop

Use sftp to upload to a directory that only this user and ceph devs can
access.

Distribute an ssh key to connect to the account. This will let us revoke
the key in the future if we feel the need. Also distribute a known_hosts
file so that users have some confidence that they are connecting to the
real ceph drop account and not some third party.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit d08e05e463f1f7106a1f719d81b849435790a3b9)

ceph.spec.in: remove trailing paren in previous commit

Signed-off-by: Gary Lowell <gary.lowell@inktank.com>

ceph.spec.in: Don't invoke debug_package macro on centos.

If the redhat-rpm-config package is installed, the debuginfo rpms will
be built by default. The build will fail when the package installed
and the specfile also invokes the macro.

Signed-off-by: Gary Lowell <gary.lowell@inktank.com>

osd: install admin socket commands after signals

This lets us tell by the presence of the admin socket commands whether
a signal will make us shut down cleanly. See #5924.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit c5b5ce120a8ce9116be52874dbbcc39adec48b5c)

ceph-disk: partprobe after creating journal partition

At least one user reports that a partprobe is needed after creating the
journal partition. It is not clear why sgdisk is not doing it, but this
fixes ceph-disk for them, and should be harmless for other users.

Fixes: #5599
Tested-by: lurbs in #ceph
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 2af59d5e81c5e3e3d7cfc50d9330d7364659c5eb)
(cherry picked from commit 3e42df221315679605d68b2875aab6c7eb6b3cc4)

mon/Paxos: always refresh after any store_state

If we store any new state, we need to refresh the services, even if we
are still in the midst of Paxos recovery.  This is because the
subscription path will share any committed state even when paxos is
still recovering.  This prevents a race like:

- we have maps 10..20
- we drop out of quorum
- we are elected leader, paxos recovery starts
- we get one LAST with committed states that trim maps 10..15
- we get a subscribe for map 10..20
   - we crash because 10 is no longer on disk because the PaxosService
     is out of sync with the on-disk state.

Fixes: #6045
Backport: dumpling
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 981eda9f7787c83dc457f061452685f499e7dd27)

mon/Paxos: return whether store_state stored anything

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 7e0848d8f88f156a05eef47a9f730b772b64fbf2)

mon/Paxos: cleanup: use do_refresh from handle_commit

This avoid duplicated code by using the helper created exactly for this
purpose.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit b9dee2285d9fe8533fa98c940d5af7b0b81f3d33)

osdc/ObjectCacher: do not merge rx buffers

We do not try to merge rx buffers currently. Make that explicit and
documented in the code that it is not supported. (Otherwise the
last_read_tid values will get lost and read results won't get applied
to the cache properly.)

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 1c50c446152ab0e571ae5508edb4ad7c7614c310)

osdc/ObjectCacher: match reads with their original rx buffers

Consider a sequence like:

1- start read on 100~200
       100~200 state rx
2- truncate to 200
       100~100 state rx
3- start read on 200~200
       100~100 state rx
       200~200 state rx
4- get 100~200 read result

Currently this makes us crash on

osdc/ObjectCacher.cc: 738: FAILED assert(bh->length() <= start+(loff_t)length-opos)

when processing the second 200~200 bufferhead (it is too big).  The
larger issue, though, is that we should not be looking at this data at
all; it has been truncated away.

Fix this by marking each rx buffer with the read request that is sent to
fill it, and only fill it from that read request.  Then the first reply
will fill the first 100~100 extend but not touch the other extent; the
second read will do that.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit b59f930ae147767eb4c9ff18c3821f6936a83227)

mon/Paxos: fix another uncommitted value corner case

It is possible that we begin the paxos recovery with an uncommitted
value for, say, commit 100.  During last/collect we discover 100 has been
committed already.  But also, another node provides an uncommitted value
for 101 with the same pn.  Currently, we refuse to learn it, because the
pn is not strictly > than our current uncommitted pn... even though it is
the next last_committed+1 value that we need.

There are two possible fixes here:

- make this a >= as we can accept newer values from the same pn.
- discard our uncommitted value metadata when we commit the value.

Let's do both!

Fixes: #6090
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit fe5010380a3a18ca85f39403e8032de1dddbe905)

os: make readdir_r buffers larger

PATH_MAX isn't quite big enough.

Backport: dumpling, cuttlefish, bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 99a2ff7da99f8cf70976f05d4fe7aa28dd7afae5)

os: fix readdir_r buffer size

The buffer needs to be big or else we're walk all over the stack.

Backport: dumpling, cuttlefish, bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 2df66d9fa214e90eb5141df4d5755b57e8ba9413)

Conflicts:

src/os/BtrfsFileStoreBackend.cc

rgw: fix crash when creating new zone on init

Moving the watch/notify init before the zone init,
as we might need to send a notification.

Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 3d55534268de7124d29bd365ea65da8d2f63e501)

rgw: change cache / watch-notify init sequence

Fixes: #6046
We were initializing the watch-notify (through the cache
init) before reading the zone info which was much too
early, as we didn't have the control pool name yet. Now
simplifying init/cleanup a bit, cache doesn't call watch/notify
init and cleanup directly, but rather states its need
through a virtual callback.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit d26ba3ab0374e77847c742dd00cb3bc9301214c2)

enable mds rejoin with active inodes' old parent xattrs

When the parent xattrs of active inodes that the mds attempts to open
during rejoin lack pool info (struct_v < 5), this field will be filled
in with -1, causing the mds to retry fetching a backtrace with a pool
number that matches the expected value, which fails and causes the
err==-ENOENT branch to be taken and retry pool 1, which succeeds, but
with pool -1, and so keeps on bouncing between the two retry cases
forever.

This patch arranges for the mds to go along with pool -1 instead of
insisting that it be refetched, enabling it to complete recovery
instead of eating cpu, network bandwidth and metadata osd's resources
like there's no tomorrow, in what AFAICT is an infinite and very busy
loop.

This is not a new problem: I've had it even before upgrading from
Cuttlefish to Dumpling, I'd just never managed to track it down, and
force-unmounting the filesystem and then restarting the mds was an
easier (if inconvenient) work-around, particularly because it always
hit when the filesystem was under active, heavy-ish use (or there
wouldn't be much reason for caps recovery ;-)

There are two issues not addressed in this patch, however.  One is
that nothing seems to proactively update the parent xattr when it is
found to be outdated, so it remains out of date forever.  Not even
renaming top-level directories causes the xattrs to be recursively
rewritten.  AFAICT that's a bug.

The other is that inodes that don't have a parent xattr (created by
even older versions of ceph) are reported as non-existing in the mds
rejoin message, because the absence of the parent xattr is signaled as
a missing inode (?failed to reconnect caps for missing inodes?).  I
suppose this may cause more serious recovery problems.

I suppose a global pass over the filesystem tree updating parent
xattrs that are out-of-date would be desirable, if we find any parent
xattrs still lacking current information; it might make sense to
activate it as a background thread from the backtrace decoding
function, when it finds a parent xattr that's too out-of-date, or as a
separate client (ceph-fsck?).

Backport: dumpling, cuttlefish
Signed-off-by: Alexandre Oliva <oliva@gnu.org>
Reviewed-by: Zheng, Yan <zheng.z.yan@intel.com>
(cherry picked from commit 617dc36d477fd83b2d45034fe6311413aa1866df)

mds: remove waiting lock before merging with neighbours

CephFS currently deadlocks under CTDB's ping_pong POSIX locking test
when run concurrently on multiple nodes.
The deadlock is caused by failed removal of a waiting_locks entry when
the waiting lock is merged with an existing lock, e.g:

Initial MDS state (two clients, same file):
held_locks -- start: 0, length: 1, client: 4116, pid: 7899, type: 2
      start: 2, length: 1, client: 4110, pid: 40767, type: 2
waiting_locks -- start: 1, length: 1, client: 4116, pid: 7899, type: 2

Waiting lock entry 4116@1:1 fires:
handle_client_file_setlock: start: 1, length: 1,
    client: 4116, pid: 7899, type: 2

MDS state after lock is obtained:
held_locks -- start: 0, length: 2, client: 4116, pid: 7899, type: 2
      start: 2, length: 1, client: 4110, pid: 40767, type: 2
waiting_locks -- start: 1, length: 1, client: 4116, pid: 7899, type: 2

Note that the waiting 4116@1:1 lock entry is merged with the existing
4116@0:1 held lock to become a 4116@0:2 held lock. However, the now
handled 4116@1:1 waiting_locks entry remains.

When handling a lock request, the MDS calls adjust_locks() to merge
the new lock with available neighbours. If the new lock is merged,
then the waiting_locks entry is not located in the subsequent
remove_waiting() call because adjust_locks changed the new lock to
include the old locks.
This fix ensures that the waiting_locks entry is removed prior to
modification during merge.

Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 476e4902907dfadb3709ba820453299ececf990b)

mon/PGMap: OSD byte counts 4x too large (conversion to bytes overzealous)

Fixes: #6049
Signed-off-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit eca53bbf583027397f0d5e050a76498585ecb059)