]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
12 years agoosd: target transaction size 300 -> 30
Sage Weil [Tue, 22 Jan 2013 04:00:26 +0000 (20:00 -0800)]
osd: target transaction size 300 -> 30

Small transactions make pg removal nicer to the op queue.  It also slows
down PG deletion a bit, which may exacerbate the PG resurrection case
until #3884 is addressed.

At least on user reported this fixed an osd that kept failing due to
an internal heartbeat failure.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 1233e8617098766c95100aa9a6a07db1a688e290)

12 years agoos/FileStore: allow filestore_queue_max_{ops,bytes} to be adjusted at runtime
Sage Weil [Tue, 22 Jan 2013 03:55:26 +0000 (19:55 -0800)]
os/FileStore: allow filestore_queue_max_{ops,bytes} to be adjusted at runtime

The 'committing' ones too.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit cfe4b8519363f92f84f724a812aa41257402865f)

12 years agoosd: make osd_max_backfills dynamically adjustable
Sage Weil [Sun, 20 Jan 2013 06:06:27 +0000 (22:06 -0800)]
osd: make osd_max_backfills dynamically adjustable

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 101955a6b8bfdf91f4229f4ecb5d5b3da096e160)

12 years agoosd: make OSD a config observer
Sage Weil [Sun, 20 Jan 2013 02:28:35 +0000 (18:28 -0800)]
osd: make OSD a config observer

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 9230c863b3dc2bdda12c23202682a84c48f070a1)

Conflicts:

src/osd/OSD.cc

12 years agolibrbd: Allow get_lock_info to fail
Dan Mick [Tue, 8 Jan 2013 19:21:22 +0000 (11:21 -0800)]
librbd: Allow get_lock_info to fail

If the lock class isn't present, EOPNOTSUPP is returned for lock calls
on newer OSDs, but sadly EIO on older; we need to treat both as
acceptable failures for RBD images.  rados lock list will still fail.

Fixes #3744.

Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4483285c9fb16f09986e2e48b855cd3db869e33c)

12 years agoosd: drop newlines from event descriptions
Sage Weil [Fri, 4 Jan 2013 21:00:56 +0000 (13:00 -0800)]
osd: drop newlines from event descriptions

These produce extra newlines in the log.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 9a1f574283804faa6dbba9165a40558e1a6a1f13)

12 years agoOSD: do deep_scrub for repair
Samuel Just [Fri, 18 Jan 2013 22:35:51 +0000 (14:35 -0800)]
OSD: do deep_scrub for repair

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 0cb760f31b0cb26f022fe8b9341e41cd5351afac)

12 years agoReplicatedPG: ignore snap link info in scrub if nlinks==0
Samuel Just [Mon, 14 Jan 2013 20:52:04 +0000 (12:52 -0800)]
ReplicatedPG: ignore snap link info in scrub if nlinks==0

links==0 implies that the replica did not sent snap link information.

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 70c3512037596a42ba6eb5eb7f96238843095db9)

12 years agoosd/PG: fix osd id in error message on snap collection errors
Sage Weil [Fri, 11 Jan 2013 20:25:22 +0000 (12:25 -0800)]
osd/PG: fix osd id in error message on snap collection errors

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 381e25870f26fad144ecc2fb99710498e3a7a1d4)

12 years agoosd/ReplicatedPG: validate ino when scrubbing snap collections
Sage Weil [Thu, 10 Jan 2013 06:34:12 +0000 (22:34 -0800)]
osd/ReplicatedPG: validate ino when scrubbing snap collections

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 665577a88b98390b9db0f9991836d10ebdd8f4cf)

12 years agoReplicatedPG: compare nlinks to snapcolls
Samuel Just [Thu, 10 Jan 2013 00:41:40 +0000 (16:41 -0800)]
ReplicatedPG: compare nlinks to snapcolls

nlinks gives us the number of hardlinks to the object.
nlinks should be 1 + snapcolls.size().  This will allow
us to detect links which remain in an erroneous snap
collection.

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit e65ea70ea64025fbb0709ee8596bb2878be0bbdc)

12 years agoReplicatedPG/PG: check snap collections during _scan_list
Samuel Just [Thu, 10 Jan 2013 23:35:10 +0000 (15:35 -0800)]
ReplicatedPG/PG: check snap collections during _scan_list

During _scan_list check the snapcollections corresponding to the
object_info attr on the object.  Report inconsistencies during
scrub_finalize.

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 57352351bb86e0ae9f64f9ba0d460c532d882de6)

12 years agoosd_types: add nlink and snapcolls fields to ScrubMap::object
Samuel Just [Wed, 9 Jan 2013 19:53:52 +0000 (11:53 -0800)]
osd_types: add nlink and snapcolls fields to ScrubMap::object

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit b85687475fa2ec74e5429d92ee64eda2051a256c)

12 years agoPG: move auth replica selection to helper in scrub
Samuel Just [Fri, 4 Jan 2013 04:16:50 +0000 (20:16 -0800)]
PG: move auth replica selection to helper in scrub

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 39bc65492af1bf1da481a8ea0a70fe7d0b4b17a3)

12 years agomon: note scrub errors in health summary
Sage Weil [Tue, 15 Jan 2013 02:23:52 +0000 (18:23 -0800)]
mon: note scrub errors in health summary

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 8e33a8b9e1fef757bbd901d55893e9b84ce6f3fc)

12 years agoosd: fix rescrub after repair
Sage Weil [Tue, 15 Jan 2013 02:31:06 +0000 (18:31 -0800)]
osd: fix rescrub after repair

We were rescrubbing if INCONSISTENT is set, but that is now persistent.
Add a new scrub_after_recovery flag that is reset on each peering interval
and set that when repair encounters errors.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a586966a3cfb10b5ffec0e9140053a7e4ff105d2)

12 years agoosd: note must_scrub* flags in PG operator<<
Sage Weil [Tue, 15 Jan 2013 02:22:02 +0000 (18:22 -0800)]
osd: note must_scrub* flags in PG operator<<

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit d56af797f996ac92bf4e0886d416fd358a2aa08e)

12 years agoosd: based INCONSISTENT pg state on persistent scrub errors
Sage Weil [Tue, 15 Jan 2013 02:21:46 +0000 (18:21 -0800)]
osd: based INCONSISTENT pg state on persistent scrub errors

This makes the state persistent across PG peering and OSD restarts.

This has the side-effect that, on recovery, we rescrub any PGs marked
inconsistent.  This is new behavior!

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 2baf1253eed630a7c4ae4cb43aab6475efd82425)

12 years agoosd: fix scrub scheduling for 0.0
Sage Weil [Tue, 15 Jan 2013 02:20:29 +0000 (18:20 -0800)]
osd: fix scrub scheduling for 0.0

The initial value for pair<utime_t,pg_t> can match pg 0.0, preventing it
from being manually scrubbed.  Fix!

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 26a63df97b2a12fd1a7c1e3cc9ccd34ca2ef9834)

12 years agoosd: note last_clean_scrub_stamp, last_scrub_errors
Sage Weil [Mon, 14 Jan 2013 07:03:01 +0000 (23:03 -0800)]
osd: note last_clean_scrub_stamp, last_scrub_errors

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 389bed5d338cf32ab14c9fc2abbc7bcc386b8a28)

12 years agoosd: add num_scrub_errors to object_stat_t
Sage Weil [Mon, 14 Jan 2013 06:59:39 +0000 (22:59 -0800)]
osd: add num_scrub_errors to object_stat_t

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 2475066c3247774a2ad048a2e32968e47da1b0f5)

12 years agoosd: add last_clean_scrub_stamp to pg_stat_t, pg_history_t
Sage Weil [Mon, 14 Jan 2013 06:43:35 +0000 (22:43 -0800)]
osd: add last_clean_scrub_stamp to pg_stat_t, pg_history_t

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit d738328488de831bf090f23e3fa6d25f6fa819df)

12 years agoosd: fix object_stat_sum_t dump signedness
Sage Weil [Mon, 14 Jan 2013 06:56:14 +0000 (22:56 -0800)]
osd: fix object_stat_sum_t dump signedness

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 6f6a41937f1bd05260a8d70b4c4a58ecadb34a2f)

12 years agoosd: change scrub min/max thresholds
Sage Weil [Mon, 14 Jan 2013 06:04:58 +0000 (22:04 -0800)]
osd: change scrub min/max thresholds

The previous 'osd scrub min interval' was mostly meaningless and useless.
Meanwhile, the 'osd scrub max interval' would only trigger a scrub if the
load was sufficiently low; if it was high, the PG might *never* scrub.

Instead, make the 'min' what the max used to be.  If it has been more than
this many seconds, and the load is low, scrub.  And add an additional
condition that if it has been more than the max threshold, scrub the PG
no matter what--regardless of the load.

Note that this does not change the default scrub interval for less-loaded
clusters, but it *does* change the meaning of existing config options.

Fixes: #3786
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 299548024acbf8123a4e488424c06e16365fba5a)

Conflicts:

PendingReleaseNotes

12 years agoosd/PG: remove useless osd_scrub_min_interval check
Sage Weil [Mon, 14 Jan 2013 04:27:59 +0000 (20:27 -0800)]
osd/PG: remove useless osd_scrub_min_interval check

This was already a no-op: we don't call PG::scrub_sched() unless it has
been osd_scrub_max_interval seconds since we last scrubbed.  Unless we
explicitly requested in, in which case we don't want this check anyway.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 16d67c798b6f752a6e03084bafe861396b86baae)

12 years agoosd: move scrub schedule random backoff to seperate helper
Sage Weil [Mon, 14 Jan 2013 04:25:39 +0000 (20:25 -0800)]
osd: move scrub schedule random backoff to seperate helper

Separate this from the load check, which will soon vary dependon on the
PG.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a148120776d0930b265411332a60e93abfbf0423)

12 years agoosd/PG: trigger scrub via scrub schedule, must_ flags
Sage Weil [Sat, 12 Jan 2013 17:18:38 +0000 (09:18 -0800)]
osd/PG: trigger scrub via scrub schedule, must_ flags

When a scrub is requested, flag it and move it to the front of the
scrub schedule instead of immediately queuing it.  This avoids
bypassing the scrub reservation framework, which can lead to a heavier
impact on performance.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 62ee6e099a8e4873287b54f9bba303ea9523d040)

12 years agoosd/PG: introduce flags to indicate explicitly requested scrubs
Sage Weil [Sat, 12 Jan 2013 17:15:16 +0000 (09:15 -0800)]
osd/PG: introduce flags to indicate explicitly requested scrubs

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 1441095d6babfacd781929e8a54ed2f8a4444467)

12 years agoosd/PG: move scrub schedule registration into a helper
Sage Weil [Sat, 12 Jan 2013 17:14:01 +0000 (09:14 -0800)]
osd/PG: move scrub schedule registration into a helper

Simplifies callers, and will let us easily modify the decision of when
to schedule the PG for scrub.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 796907e2159371f84a16cbd35f6caa8ac868acf6)

12 years agoos/FileStore: only flush inline if write is sufficiently large
Sage Weil [Fri, 18 Jan 2013 20:14:48 +0000 (12:14 -0800)]
os/FileStore: only flush inline if write is sufficiently large

Honor filestore_flush_min in the inline flush case.

Backport: bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 49726dcf973c38c7313ab78743b45ccc879671ea)

12 years agoos/FileStore: fix compile when sync_file_range is missing;
Sage Weil [Fri, 18 Jan 2013 20:14:40 +0000 (12:14 -0800)]
os/FileStore: fix compile when sync_file_range is missing;

If sync_file_range is not present, we always close inline, and flush
via fdatasync(2).

Fixes compile on ancient platforms like RHEL5.8.

Backport: bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 8ddb55d34c72e6df1023cf427cbd41f3f98da402)

12 years agoosd: set pg removal transactions based on configurable
Sage Weil [Fri, 18 Jan 2013 23:23:22 +0000 (15:23 -0800)]
osd: set pg removal transactions based on configurable

Use the osd_target_transaction_size knob, and gracefully tolerate bogus
values (e.g., <= 0).

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 5e00af406b89c9817e9a429f92a05ca9c29b19c3)

12 years agoosd: make pg removal thread more friendly
Sage Weil [Fri, 18 Jan 2013 23:30:06 +0000 (15:30 -0800)]
osd: make pg removal thread more friendly

For a large PG these are saturating the filestore and journal queues.  Do
them synchronously to make them more friendly.  They don't need to be fast.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4712e984d3f62cdf51ea67da8197eed18a5983dd)

12 years agoos: move apply_transactions() sync wrapper into ObjectStore
Sage Weil [Fri, 18 Jan 2013 23:27:24 +0000 (15:27 -0800)]
os: move apply_transactions() sync wrapper into ObjectStore

This has nothing to do with the backend implementation.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit bc994045ad67fb70c7a0457b8cd29273dd5d1654)

12 years agoos: add apply_transaction() variant that takes a sequencer
Sage Weil [Fri, 18 Jan 2013 23:28:24 +0000 (15:28 -0800)]
os: add apply_transaction() variant that takes a sequencer

Also, move the convenience wrappers into the interface and funnel through
a single implementation.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit f6c69c3f1ac35546b90315fff625993ba5cd8c07)

12 years agoosd: calculate initial PG mapping from PG's osdmap
Sage Weil [Mon, 21 Jan 2013 00:11:10 +0000 (16:11 -0800)]
osd: calculate initial PG mapping from PG's osdmap

The initial values of up/acting need to be based on the PG's osdmap, not
the OSD's latest.  This can cause various confusion in
pg_interval_t::check_new_interval() when calling OSDMap methods due to the
up/acting OSDs not existing yet (for example).

Fixes: #3879
Reported-by: Jens Kristian S?gaard <jens@mermaidconsulting.dk>
Tested-by: Jens Kristian S?gaard <jens@mermaidconsulting.dk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 17160843d0c523359d8fa934418ff2c1f7bffb25)

12 years agoosdmap: make replica separate in default crush map configurable
Sage Weil [Thu, 17 Jan 2013 23:01:35 +0000 (15:01 -0800)]
osdmap: make replica separate in default crush map configurable

Add 'osd crush chooseleaf type' option to control what the default
CRUSH rule separates replicas across.  Default to 1 (host), and set it
to 0 in vstart.sh.

Fixes: #3785
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit c236a51a8040508ee893e4c64b206e40f9459a62)

12 years agoceph: adjust crush tunables via 'ceph osd crush tunables <profile>'
Sage Weil [Wed, 16 Jan 2013 22:09:53 +0000 (14:09 -0800)]
ceph: adjust crush tunables via 'ceph osd crush tunables <profile>'

Make it easy to adjust crush tunables.  Create profiles:

 legacy: the legacy values
 argonaut: the argonaut defaults, and what is supported.. legacy! (*(
 bobtail: best that bobtail supports
 optimal: the current optimal values
 default: the current default values

* In actuality, argonaut supports some of the tunables, but it doesn't
  say so via the feature bits.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 19ee23111585f15a39ee2907fa79e2db2bf523f0)

12 years agomsg/Pipe: use state_closed atomic_t for _lookup_pipe
Sage Weil [Sat, 29 Dec 2012 01:20:43 +0000 (17:20 -0800)]
msg/Pipe: use state_closed atomic_t for _lookup_pipe

We shouldn't look at Pipe::state in SimpleMessenger::_lookup_pipe() without
holding pipe_lock.  Instead, use an atomic that we set to non-zero only
when transitioning to the terminal STATE_CLOSED state.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 82f8bcddb5fa09913eb477ee26c71d6b4bb8d97c)

12 years agomsgr: inject delays at inconvenient times
Sage Weil [Sun, 23 Dec 2012 21:43:15 +0000 (13:43 -0800)]
msgr: inject delays at inconvenient times

Exercise some rare races by injecting delays before taking locks
via the 'ms inject internal delays' option.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a5d692a7b9b4bec2c27993ca37aa3fec4065292b)

12 years agomsgr: fix race on Pipe removal from hash
Sage Weil [Sun, 23 Dec 2012 17:22:18 +0000 (09:22 -0800)]
msgr: fix race on Pipe removal from hash

When a pipe is faulting and shutting down, we have to drop pipe_lock to
take msgr lock and then remove the entry.  The Pipe in this case will
have STATE_CLOSED.  Handle this case in all places we do a lookup on
the rank_pipe hash so that we effectively ignore entries that are
CLOSED.

This fixes a race introduced by the previous commit where we won't use
the CLOSED pipe and try to register a new one, but the old one is still
registered.

See bug #3675.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e99b4a307b4427945a4eb5ec50e65d6239af4337)

12 years agomsgr: don't queue message on closed pipe
Sage Weil [Sun, 23 Dec 2012 17:19:05 +0000 (09:19 -0800)]
msgr: don't queue message on closed pipe

If we have a con that refs a pipe but it is closed, don't use it.  If
the ref is still there, it is only because we are racing with fault()
and it is about to (or just was) be detached.  Either way,

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 6339c5d43974f4b495f15d199e01a141e74235f5)

12 years agomsgr: atomically queue first message with connect_rank
Sage Weil [Sun, 23 Dec 2012 05:24:52 +0000 (21:24 -0800)]
msgr: atomically queue first message with connect_rank

Atomically queue the first message on the new pipe, without dropping
and retaking pipe_lock.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 7bf0b0854d1f2706a3a2302bcbf92dd5c8c888ef)

12 years agoconfig_opts.h: default osd_recovery_delay_start to 0
Samuel Just [Thu, 10 Jan 2013 19:06:02 +0000 (11:06 -0800)]
config_opts.h: default osd_recovery_delay_start to 0

This setting was intended to prevent recovery from overwhelming peering traffic
by delaying the recovery_wq until osd_recovery_delay_start seconds after pgs
stop being added to it.  This should be less necessary now that recovery
messages are sent with strictly lower priority then peering messages.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Gregory Farnum <greg@inktank.com>
(cherry picked from commit 44625d4460f61effe2d63d8280752f10f159e7b4)

12 years agoosdmaptool: more fix cli test
Sage Weil [Thu, 17 Jan 2013 05:19:18 +0000 (21:19 -0800)]
osdmaptool: more fix cli test

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit b0162fab3d927544885f2b9609b9ab3dc4aaff74)

12 years agoosdmaptool: fix cli test
Sage Weil [Thu, 17 Jan 2013 05:10:26 +0000 (21:10 -0800)]
osdmaptool: fix cli test

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 5bd8765c918174aea606069124e43c480c809943)

12 years agoosdmaptool: allow user to specify pool for test-map-object
Samuel Just [Wed, 16 Jan 2013 22:21:47 +0000 (14:21 -0800)]
osdmaptool: allow user to specify pool for test-map-object

Fixes: #3820
Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Gregory Farnum <greg@inktank.com>
(cherry picked from commit 85eb8e382a26dfc53df36ae1a473185608b282aa)

12 years agorados.cc: fix rmomapkey usage: val not needed
David Zafman [Wed, 16 Jan 2013 20:41:16 +0000 (12:41 -0800)]
rados.cc: fix rmomapkey usage: val not needed

Signed-off-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Samuel Just <samuel.just@inktank.com>
(cherry picked from commit 625c3cb9b536a0cff7249b8181b7a4f09b1b4f4f)

12 years agolibrados.hpp: fix omap_get_vals and omap_get_keys comments
Samuel Just [Wed, 16 Jan 2013 05:27:23 +0000 (21:27 -0800)]
librados.hpp: fix omap_get_vals and omap_get_keys comments

We list keys greater than start_after.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 3f0ad497b3c4a5e9bef61ecbae5558ae72d4ce8b)

12 years agorados.cc: use omap_get_vals_by_keys in getomapval
Samuel Just [Wed, 16 Jan 2013 05:26:22 +0000 (21:26 -0800)]
rados.cc: use omap_get_vals_by_keys in getomapval

Fixes: #3811
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit cb5e2be418924cf8b2c6a6d265a7a0327f08d00a)

12 years agorados.cc: fix listomapvals usage: key,val are not needed
Samuel Just [Wed, 16 Jan 2013 05:24:50 +0000 (21:24 -0800)]
rados.cc: fix listomapvals usage: key,val are not needed

Fixes: #3812
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 44c45e520cc2e60c6c803bb245edb9330bff37e4)

12 years agorgw: copy object should not copy source acls
Yehuda Sadeh [Wed, 16 Jan 2013 23:01:47 +0000 (15:01 -0800)]
rgw: copy object should not copy source acls

Fixes: #3802
Backport: argonaut, bobtail

When using the S3 api and x-amz-metadata-directive is
set to COPY we used to copy complete metadata of source
object. However, this shouldn't include the source ACLs.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 37dbf7d9df93dd0e92019be31eaa1a19dd9569c7)

12 years agoOSD: only trim up to the oldest map still in use by a pg
Samuel Just [Fri, 11 Jan 2013 19:02:15 +0000 (11:02 -0800)]
OSD: only trim up to the oldest map still in use by a pg

map_cache.cached_lb() provides us with a lower bound across
all pgs for in-use osdmaps.  We cannot trim past this since
those maps are still in use.

backport: bobtail
Fixes: #3770
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 66eb93b83648b4561b77ee6aab5b484e6dba4771)

12 years agoRevert "osdmap: spread replicas across hosts with default crush map"
Sage Weil [Mon, 14 Jan 2013 16:15:02 +0000 (08:15 -0800)]
Revert "osdmap: spread replicas across hosts with default crush map"

This reverts commit 503917f0049d297218b1247dc0793980c39195b3.

This breaks vstart and teuthology configs.  A better fix is coming.

12 years agomon: OSDMonitor: don't output to stdout in plain text if json is specified
Joao Eduardo Luis [Thu, 10 Jan 2013 18:54:12 +0000 (18:54 +0000)]
mon: OSDMonitor: don't output to stdout in plain text if json is specified

Fixes: #3748
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 410906e04936c935903526f26fb7db16c412a711)

12 years agoosdmap: spread replicas across hosts with default crush map
Sage Weil [Sat, 12 Jan 2013 01:23:22 +0000 (17:23 -0800)]
osdmap: spread replicas across hosts with default crush map

This is more often the case than not, and we don't have a good way to
magically know what size of cluster the user will be creating.  Better to
err on the side of doing the right thing for more people.

Fixes: #3785
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 7ea5d84fa3d0ed3db61eea7eb9fa8dbee53244b6)

12 years agoReplicatedPG: fix snapdir trimming
Samuel Just [Thu, 10 Jan 2013 03:17:23 +0000 (19:17 -0800)]
ReplicatedPG: fix snapdir trimming

The previous logic was both complicated and not correct.  Consequently,
we have been tending to drop snapcollection links in some cases.  This
has resulted in clones incorrectly not being trimmed.  This patch
replaces the logic with something less efficient but hopefully a bit
clearer.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 0f42c37359d976d1fe90f2d3b877b9b0268adc0b)

12 years agov0.56.1 v0.56.1
Gary Lowell [Mon, 7 Jan 2013 21:33:30 +0000 (13:33 -0800)]
v0.56.1

12 years agomsg/Pipe: prepare Message data for wire under pipe_lock
Sage Weil [Sun, 6 Jan 2013 16:38:27 +0000 (08:38 -0800)]
msg/Pipe: prepare Message data for wire under pipe_lock

We cannot trust the Message bufferlists or other structures to be
stable without pipe_lock, as another Pipe may claim and modify the sent
list items while we are writing to the socket.

Related to #3678.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit d16ad9263d7b1d3c096f56c56e9631fae8509651)

12 years agomsgr: update Message envelope in encode, not write_message
Sage Weil [Sun, 6 Jan 2013 16:33:01 +0000 (08:33 -0800)]
msgr: update Message envelope in encode, not write_message

Fill out the Message header, footer, and calculate CRCs during
encoding, not write_message().  This removes most modifications from
Pipe::write_message().

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 40706afc66f485b2bd40b2b4b1cd5377244f8758)

12 years agomsg/Pipe: encode message inside pipe_lock
Sage Weil [Sun, 6 Jan 2013 16:25:40 +0000 (08:25 -0800)]
msg/Pipe: encode message inside pipe_lock

This modifies bufferlists in the Message struct, and it is possible
for multiple instances of the Pipe to get references on the Message;
make sure they don't modify those bufferlists concurrently.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4cfc4903c6fb130b6ac9105baf1f66fbda797f14)

12 years agomsg/Pipe: associate sending msgs to con inside lock
Sage Weil [Sat, 5 Jan 2013 18:39:08 +0000 (10:39 -0800)]
msg/Pipe: associate sending msgs to con inside lock

Associate a sending message with the connection inside the pipe_lock.
This way if a racing thread tries to steal these messages it will
be sure to reset the con point *after* we do such that it the con
pointer is valid in encode_payload() (and later).

This may be part of #3678.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a058f16113efa8f32eb5503d5443aa139754d479)

12 years agomsg/Pipe: fix msg leak in requeue_sent()
Sage Weil [Sat, 5 Jan 2013 17:29:50 +0000 (09:29 -0800)]
msg/Pipe: fix msg leak in requeue_sent()

The sent list owns a reference to each message.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 2a1eb466d3f8e25ec8906b3ca6118a14c4e269d2)

12 years agoosdc/Objecter: fix linger_ops iterator invalidation on pool deletion
Sage Weil [Mon, 7 Jan 2013 20:58:39 +0000 (12:58 -0800)]
osdc/Objecter: fix linger_ops iterator invalidation on pool deletion

The call to check_linger_pool_dne() may unregister the linger request,
invalidating the iterator.  To avoid this, increment the iterator at
the top of the loop.

This mirror the fix in 4bf9078286d58c2cd4e85cb8b31411220a377092 for
regular non-linger ops.

Fixes: #3734
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 62586884afd56f2148205bdadc5a67037a750a9b)

12 years agoos/FileJournal: include limits.h
Sage Weil [Sun, 6 Jan 2013 04:53:49 +0000 (20:53 -0800)]
os/FileJournal: include limits.h

Needed for IOV_MAX.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ce49968938ca3636f48fe543111aa219f36914d8)

12 years agoosd: special case CALL op to not have RD bit effects
Sage Weil [Sat, 5 Jan 2013 01:43:41 +0000 (17:43 -0800)]
osd: special case CALL op to not have RD bit effects

In commit 20496b8d2b2c3779a771695c6f778abbdb66d92a we treat a CALL as
different from a normal "read", but we did not adjust the behavior
determined by the RD bit in the op.  We tried to fix that in
91e941aef9f55425cc12204146f26d79c444cfae, but changing the op code breaks
compatibility, so that was reverted.

Instead, special-case CALL in the helper--the only point in the code that
actually checks for the RD bit.  (And fix one lingering user to use that
helper appropriately.)

Fixes: #3731
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 988a52173522e9a410ba975a4e8b7c25c7801123)

12 years agoRevert "OSD: remove RD flag from CALL ops"
Sage Weil [Sat, 5 Jan 2013 04:46:48 +0000 (20:46 -0800)]
Revert "OSD: remove RD flag from CALL ops"

This reverts commit 91e941aef9f55425cc12204146f26d79c444cfae.

We cannot change this op code without breaking compatibility
with old code (client and server).  We'll have to special case
this op code instead.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit d3abd0fe0bb402ff403259d4b1a718a56331fc39)

12 years agoReplicatedPG: remove old-head optization from push_to_replica
Samuel Just [Fri, 4 Jan 2013 20:43:52 +0000 (12:43 -0800)]
ReplicatedPG: remove old-head optization from push_to_replica

This optimization allowed the primary to push a clone as a single push in the
case that the head object on the replica is old and happens to be at the same
version as the clone.  In general, using head in clone_subsets is tricky since
we might be writing to head during the push.  calc_clone_subsets does not
consider head (probably for this reason).  Handling the clone from head case
properly would require blocking writes on head in the interim which is probably
a bad trade off anyway.

Because the old-head optimization only comes into play if the replica's state
happens to fall on the last write to head prior to the snap that caused the
clone in question, it's not worth the complexity.

Fixes: #3698
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e89b6ade63cdad315ab754789de24008cfe42b37)

12 years agoos/FileStore: fix non-btrfs op_seq commit order
Sage Weil [Fri, 4 Jan 2013 01:15:07 +0000 (17:15 -0800)]
os/FileStore: fix non-btrfs op_seq commit order

The op_seq file is the starting point for journal replay.  For stable btrfs
commit mode, which is using a snapshot as a reference, we should write this
file before we take the snap.  We normally ignore current/ contents anyway.

On non-btrfs file systems, however, we should only write this file *after*
we do a full sync, and we should then fsync(2) it before we continue
(and potentially trim anything from the journal).

This fixes a serious bug that could cause data loss and corruption after
a power loss event.  For a 'kill -9' or crash, however, there was little
risk, since the writes were still captured by the host's cache.

Fixes: #3721
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 28d59d374b28629a230d36b93e60a8474c902aa5)

12 years agoOSD: for old osds, dispatch peering messages immediately
Samuel Just [Thu, 3 Jan 2013 17:59:45 +0000 (09:59 -0800)]
OSD: for old osds, dispatch peering messages immediately

Normally, we batch up peering messages until the end of
process_peering_events to allow us to combine many notifies, etc
to the same osd into the same message.  However, old osds assume
that the actiavtion message (log or info) will be _dispatched
before the first sub_op_modify of the interval.  Thus, for those
peers, we need to send the peering messages before we drop the
pg lock, lest we issue a client repop from another thread before
activation message is sent.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4ae4dce5c5bb547c1ff54d07c8b70d287490cae9)

12 years agoosd: move common active vs booting code into consume_map
Sage Weil [Thu, 3 Jan 2013 06:38:53 +0000 (22:38 -0800)]
osd: move common active vs booting code into consume_map

Push osdmaps to PGs in separate method from activate_map() (whose name
is becoming less and less accurate).

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a32d6c5dca081dcd8266f4ab51581ed6b2755685)

12 years agoosd: let pgs process map advances before booting
Sage Weil [Thu, 3 Jan 2013 06:20:06 +0000 (22:20 -0800)]
osd: let pgs process map advances before booting

The OSD deliberate consumes and processes most OSDMaps from while it
was down before it marks itself up, as this is can be slow.  The new
threading code does this asynchronously in peering_wq, though, and
does not let it drain before booting the OSD.  The OSD can get into
a situation where it marks itself up but is not responsive or useful
because of the backlog, and only makes the situation works by
generating more osdmaps as result.

Fix this by calling activate_map() even when booting, and when booting
draining the peering_wq on each call.  This is harmless since we are
not yet processing actual ops; we only need to be async when active.

Fixes: #3714
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 0bfad8ef2040a0dd4a0dc1d3abf3ab5b2019d179)

12 years agolog: broadcast cond signals
Sage Weil [Fri, 28 Dec 2012 21:07:18 +0000 (13:07 -0800)]
log: broadcast cond signals

We were using a single cond, and only signalling one waiter.  That means
that if the flusher and several logging threads are waiting, and we hit
a limit, we the logger could signal another logger instead of the flusher,
and we could deadlock.

Similarly, if the flusher empties the queue, it might signal only a single
logger, and that logger could re-signal the flusher, and the other logger
could wait forever.

Intead, break the single cond into two: one for loggers, and one for the
flusher.  Always signal the (one) flusher, and always broadcast to all
loggers.

Backport: bobtail, argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 813787af3dbb99e42f481af670c4bb0e254e4432)

12 years agolog: fix locking typo/stupid for dump_recent()
Sage Weil [Wed, 2 Jan 2013 21:58:44 +0000 (13:58 -0800)]
log: fix locking typo/stupid for dump_recent()

We weren't locking m_flush_mutex properly, which in turn was leading to
racing threads calling dump_recent() and garbling the crash dump output.

Backport: bobtail, argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 43cba617aa0247d714632bddf31b9271ef3a1b50)

12 years agotest_filejournal: optionally specify journal filename as an argument
Sage Weil [Sat, 29 Dec 2012 00:48:22 +0000 (16:48 -0800)]
test_filejournal: optionally specify journal filename as an argument

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 483c6f76adf960017614a8641c4dcdbd7902ce33)

12 years agotest_filejournal: test journaling bl with >IOV_MAX segments
Sage Weil [Sat, 29 Dec 2012 00:48:05 +0000 (16:48 -0800)]
test_filejournal: test journaling bl with >IOV_MAX segments

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit c461e7fc1e34fdddd8ff8833693d067451df906b)

12 years agoos/FileJournal: limit size of aio submission
Sage Weil [Sat, 29 Dec 2012 00:47:28 +0000 (16:47 -0800)]
os/FileJournal: limit size of aio submission

Limit size of each aio submission to IOV_MAX-1 (to be safe).  Take care to
only mark the last aio with the seq to signal completion.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit dda7b651895ab392db08e98bf621768fd77540f0)

12 years agoos/FileJournal: logger is optional
Sage Weil [Fri, 28 Dec 2012 23:44:51 +0000 (15:44 -0800)]
os/FileJournal: logger is optional

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 076b418c7f03c5c62f811fdc566e4e2b776389b7)

12 years agov0.56 v0.56
Gary Lowell [Tue, 1 Jan 2013 01:10:11 +0000 (17:10 -0800)]
v0.56

12 years agoMerge remote-tracking branch 'gh/wip-rbd-unprotect' into next
Sage Weil [Sun, 30 Dec 2012 23:29:37 +0000 (15:29 -0800)]
Merge remote-tracking branch 'gh/wip-rbd-unprotect' into next

Reviewed-by: Sage Weil <sage@inktank.com>
12 years agodoc: fix rbd permissions for unprotect
Josh Durgin [Sun, 30 Dec 2012 07:57:01 +0000 (23:57 -0800)]
doc: fix rbd permissions for unprotect

Unprotect examines all pools, so use blanket x before 0.54. After
that, use class-read restricted by object_prefix to rbd_children.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agolibrbd: fix race between unprotect and clone
Josh Durgin [Sun, 30 Dec 2012 04:35:15 +0000 (20:35 -0800)]
librbd: fix race between unprotect and clone

Clone needs to actually re-read the header to make sure the image is
still protected before returning. Additionally, it needs to consider
the image protected *only* if the protection status is protected -
unprotecting does not count. I thought I'd already fixed this, but
can't find the commit.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agorbd: open (source) image as read-only
Josh Durgin [Sun, 30 Dec 2012 04:26:57 +0000 (20:26 -0800)]
rbd: open (source) image as read-only

This allows users without write access to copy, export and list
information about an image.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agolibrbd: open parent as read-only during clone
Josh Durgin [Sat, 29 Dec 2012 06:13:37 +0000 (22:13 -0800)]
librbd: open parent as read-only during clone

We never write to the parent, and don't need to watch it during this process.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agolibrbd: add {rbd_}open_read_only()
Josh Durgin [Sat, 29 Dec 2012 00:54:51 +0000 (16:54 -0800)]
librbd: add {rbd_}open_read_only()

Since 58890cfad5f7bee933baa599a68e6c65993379d4, regular {rbd_}open()
would fail with -EPERM if the user did not have write access to the
pool, since a watch on the header was requested.

For many uses of read-only access, establishing a watch is not
necessary, since changes to the header do not matter. For example,
getting metadata about an image via 'rbd info' does not care if a new
snapshot is created while it is in progress.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoOSD: remove RD flag from CALL ops
Josh Durgin [Sat, 29 Dec 2012 03:47:09 +0000 (19:47 -0800)]
OSD: remove RD flag from CALL ops

20496b8d2b2c3779a771695c6f778abbdb66d92a forgot to do this. Without
this change, all class methods required regular read permission in
addition to class-read or class-write.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agocls_rbd: get_children does not need write permission
Josh Durgin [Sat, 29 Dec 2012 03:44:36 +0000 (19:44 -0800)]
cls_rbd: get_children does not need write permission

This prevented a read-only user from being able to unprotect a
snapshot without write permission on all pools. This was masked before
by the CLS_METHOD_PUBLIC flag.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoinit-ceph: ok, 8K files
Sage Weil [Sat, 29 Dec 2012 01:12:06 +0000 (17:12 -0800)]
init-ceph: ok, 8K files

16K might be a bit many.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: remove broken cephs signing requirement check
Sage Weil [Fri, 28 Dec 2012 00:01:49 +0000 (16:01 -0800)]
msg/Pipe: remove broken cephs signing requirement check

Remove the special-case check, which does not inform the peer what
protocol features are missing.  It also enforces this requirement even
when we negotiate auth none.

Reported as part of bug #3657.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: include remote socket addr in debug output
Sage Weil [Sat, 29 Dec 2012 00:00:47 +0000 (16:00 -0800)]
msg/Pipe: include remote socket addr in debug output

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: allow RecoveryDone self-transition in RepNotRecovering
Sage Weil [Fri, 28 Dec 2012 19:34:47 +0000 (11:34 -0800)]
osd: allow RecoveryDone self-transition in RepNotRecovering

In a mixed cluster where some OSDs support the recovery reservations and
some don't, the replica may be new code in RepNotRecoverying and will
complete a backfill.  In that case, we want to just stayin
RepNotRecovering.

It may also be possible to make it infer what the primary is doing even
thought it is not sending recovery reservation messages, but this is much
more complicated and doesn't accomplish much.

Fixes: #3689
Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: less noise about inefficient tmap updates
Sage Weil [Fri, 28 Dec 2012 20:34:15 +0000 (12:34 -0800)]
osd: less noise about inefficient tmap updates

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoinit-ceph: default to 16K max_open_files
Sage Weil [Fri, 28 Dec 2012 20:11:55 +0000 (12:11 -0800)]
init-ceph: default to 16K max_open_files

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agorgw: disable ops and usage logging by default
Sage Weil [Thu, 27 Dec 2012 21:27:46 +0000 (13:27 -0800)]
rgw: disable ops and usage logging by default

Most users don't need this, and having it on will just fill their clusters
with objects that will need to be cleaned up later.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agojava: remove deprecated libcephfs
Noah Watkins [Thu, 27 Dec 2012 20:06:02 +0000 (12:06 -0800)]
java: remove deprecated libcephfs

Removes ceph_set_default_*

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
12 years agoinit-ceph: fix status version check across machines
Sage Weil [Fri, 28 Dec 2012 00:06:24 +0000 (16:06 -0800)]
init-ceph: fix status version check across machines

The local state isn't propagated into the backtick shell, resulting in
'unknown' for all remote daemons.  Avoid backticks altogether.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: fix recovery assert for pg repair case
Sage Weil [Wed, 26 Dec 2012 23:27:07 +0000 (15:27 -0800)]
osd: fix recovery assert for pg repair case

In the case of PG repair, this assert is not valid.  Disable it for now.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agodropping xfs test 186 due to bug: 3685
tamil [Thu, 27 Dec 2012 19:27:31 +0000 (11:27 -0800)]
dropping xfs test 186 due to bug: 3685

Signed-off-by: tamil <tamil.muthamizhan@inktank.com>
12 years agoosd: drop 'osd recovery max active' back to previous default (5)
Sage Weil [Thu, 27 Dec 2012 19:12:33 +0000 (11:12 -0800)]
osd: drop 'osd recovery max active' back to previous default (5)

Having this too large means that queues get too deep on the OSDs during
backfill and latency is very high.  In my tests, it also meant we generated
a lot of slow recovery messages just from the recovery ops themselves (no
client io).

Keeping this at the old default means we are no worse in this respect than
argonaut, which is a safe position to start from.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agojournal: reduce journal max queue size
Sage Weil [Thu, 27 Dec 2012 19:11:08 +0000 (11:11 -0800)]
journal: reduce journal max queue size

Keep the journal queue size smaller than the filestore queue size.

Keeping this small also means that we can lower the latency for new
high priority ops that come into the op queue.

Signed-off-by: Sage Weil <sage@inktank.com>