git-server-git.apps.pok.os.sepia.ceph.com Git

]> git-server-git.apps.pok.os.sepia.ceph.com Git - ceph.git/log

Danny Al-Gaaf [Mon, 25 Feb 2013 14:36:37 +0000 (15:36 +0100)]

debian: add new files

Add new (installed) files to debian install files.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>

commit | commitdiff | tree

Danny Al-Gaaf [Mon, 25 Feb 2013 14:34:17 +0000 (15:34 +0100)]

ceph.spec.in: add new files

Add new files to spec file since they get installed.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>

commit | commitdiff | tree

Sage Weil [Sun, 24 Feb 2013 21:31:06 +0000 (13:31 -0800)]

debian: make gdisk, parted requirements, not recommendations.

ceph-prepare-disk (and thus ceph-deploy) need this.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Sun, 24 Feb 2013 21:22:47 +0000 (13:22 -0800)]

Merge remote-tracking branch 'gh/next'

commit | commitdiff | tree

Neil Levine [Sat, 23 Feb 2013 00:43:44 +0000 (00:43 +0000)]

Minor wording change.

Signed-off-by: Neil Levine <neil.levine@inktank.com>

commit | commitdiff | tree

Neil Levine [Sat, 23 Feb 2013 00:24:48 +0000 (00:24 +0000)]

Grammar typo

Signed-off-by: Neil Levine <neil.levine@inktank.com>

commit | commitdiff | tree

Neil Levine [Fri, 22 Feb 2013 22:41:09 +0000 (22:41 +0000)]

Changes to the OS support, multi-data center, and hypervisor questions.

Signed-off-by: Neil Levine <neil.levine@inktank.com>

commit | commitdiff | tree

Sage Weil [Sun, 24 Feb 2013 00:36:36 +0000 (16:36 -0800)]

mds: reencode MDSMap in MMDSMap if MDSENC feature is not present

In some cases the MMDSMap message from mon -> client passes from leader ->
peon -> client, and the leader doesn't encode with the correct feature
bits. As with MMOSDMap, we reencode the nested MDSMap based on the
features if relevant bits are not present.

We forgot to include this with the mds encoding changes.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Sat, 23 Feb 2013 16:38:10 +0000 (08:38 -0800)]

qa/run_xfstests.sh: use $TESTDIR instead of /tmp/cephtest

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Thu, 21 Feb 2013 19:15:58 +0000 (11:15 -0800)]

osd: an interval can't go readwrite if its acting is empty

Let's not forget that min_size can be zero.

Fixes: #4159
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4277265d99647c9fe950ba627e5d86234cfd70a9)

commit | commitdiff | tree

Sage Weil [Sat, 23 Feb 2013 00:24:18 +0000 (16:24 -0800)]

mkcephfs: create mon data dir prior to ceph-mon --mkfs

ceph-mon now expects this directory to already exist.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

John Wilkins [Fri, 22 Feb 2013 23:38:20 +0000 (15:38 -0800)]

doc: Added a lot of info to OSD troubleshooting.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

commit | commitdiff | tree

John Wilkins [Fri, 22 Feb 2013 23:37:03 +0000 (15:37 -0800)]

doc: Added mention of Admin Socket interface and brief description.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

commit | commitdiff | tree

John Wilkins [Fri, 22 Feb 2013 23:35:24 +0000 (15:35 -0800)]

doc: Changed title to OSD and PG, indicating both subjects are covered.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

commit | commitdiff | tree

John Wilkins [Fri, 22 Feb 2013 23:34:38 +0000 (15:34 -0800)]

doc: Added references from monitoring OSD to troubleshooting OSD.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

commit | commitdiff | tree

John Wilkins [Fri, 22 Feb 2013 23:33:43 +0000 (15:33 -0800)]

doc: set maxdepth to 2, so TOC isn't so long with new OSD troubleshooting.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

commit | commitdiff | tree

Sage Weil [Fri, 22 Feb 2013 23:15:27 +0000 (15:15 -0800)]

client: use 4MB f_bsize and f_frsize for statfs

Old stat(1) reports:

  Block size: 1048576    Fundamental block size: 1048576

and the df(1) arithmetic works out.  New stat(1) reports:

  Block size: 1048576    Fundamental block size: 4096

which is what we are shoving into statvfs, but we have the b_size and
fr_size arithmetic swapped.  However, doing the *correct* reporting would
then break the old stat by making both sizes appear to be 4KB (or
whatever).

Sidestep the issue by making *both* values 4MB.. which is both large enough
to report large FS sizes, and also the default stripe size and thus a
"reasonable" value to report for a block size.

Perhaps in the future, when we no longer care about old userland, we can
report the page size for f_bsize, which is probably the "most correct"
thing to do.

Fixes: #3794. See also #3793.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>

commit | commitdiff | tree

Sage Weil [Fri, 22 Feb 2013 22:57:45 +0000 (14:57 -0800)]

test/librados/watch_notify: fix warning

In file included from test/librados/watch_notify.cc:8:0:
../src/gtest/include/gtest/gtest.h: In function ‘testing::AssertionResult testing::internal::CmpHelperEQ(const char*, const char*, const T1&, const T2&) [with T1 = long unsigned int, T2 = int]’:
../src/gtest/include/gtest/gtest.h:1300:30: instantiated from ‘static testing::AssertionResult testing::internal::EqHelper::Compare(const char*, const char*, const T1&, const T2&) [with T1 = long unsigned int, T2 = int, bool lhs_is_null_literal = false]’
test/librados/watch_notify.cc:67:224: instantiated from here
warning: ../src/gtest/include/gtest/gtest.h:1263:3: comparison between signed and unsigned integer expressions [-Wsign-compare]

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Fri, 22 Feb 2013 22:39:34 +0000 (14:39 -0800)]

ceph-object-corpus: re-update

This was set by 9af94eea209fc2555f66214f01f3edddc35d4209, then single
paxos merge, then accidentally reverted by the next commit
6cb53740f2c356768adfbd3cb55c007d187309d3.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Samuel Just [Fri, 22 Feb 2013 22:12:28 +0000 (14:12 -0800)]

PG::proc_replica_log: oinfo.last_complete must be *before* first entry in omissing

Fixes: #4189
Signed-off-by: Samuel Just <sam.just@inktank.com>

commit | commitdiff | tree

Sage Weil [Fri, 22 Feb 2013 22:23:45 +0000 (14:23 -0800)]

Merge remote-tracking branch 'gh/wip-rbd-flatten-deadlock'

Reviewed-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Fri, 22 Feb 2013 22:16:07 +0000 (14:16 -0800)]

Merge remote-tracking branch 'gh/wip-objecter-fsx'

Reviewed-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

David Zafman [Fri, 22 Feb 2013 20:49:47 +0000 (12:49 -0800)]

Merge branch 'wip-3403-4-rebase'

Feature: #3403

Signed-off-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>

commit | commitdiff | tree

Josh Durgin [Fri, 22 Feb 2013 07:31:21 +0000 (23:31 -0800)]

objecter: don't resend linger ops unnecessarily

recalc_linger_op_target() was checking and then setting
linger_op->pgid and linger_op->active, but these were only set by
recalc_linger_op_target(). This was only called by handle_osd_map(),
so the first osdmap after a watch was established would cause a resend
of the watch. Analogous to the normal Op, set this information by
calling recalc_linger_op_target in send_linger().

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Josh Durgin [Fri, 22 Feb 2013 07:22:59 +0000 (23:22 -0800)]

objecter: initialize linger op snapid

Since they are write ops now, it must be CEPH_NOSNAP or the OSD
returns EINVAL.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

David Zafman [Fri, 22 Feb 2013 05:45:06 +0000 (21:45 -0800)]

Add test for list_watchers() C++ interface

Signed-off-by: David Zafman <david.zafman@inktank.com>

commit | commitdiff | tree

David Zafman [Fri, 22 Feb 2013 01:59:17 +0000 (17:59 -0800)]

Add listwatchers command to rados

Signed-off-by: David Zafman <david.zafman@inktank.com>

commit | commitdiff | tree

David Zafman [Thu, 21 Feb 2013 23:20:08 +0000 (15:20 -0800)]

Add ObjectReadOperation and IoCtx functions

Signed-off-by: David Zafman <david.zafman@inktank.com>

commit | commitdiff | tree

David Zafman [Fri, 22 Feb 2013 00:11:01 +0000 (16:11 -0800)]

librados: expose a list of watchers on an object

Add new op CEPH_OSD_OP_LIST_WATCHERS
Add Objecter handling

Signed-off-by: David Zafman <david.zafman@inktank.com>

commit | commitdiff | tree

David Zafman [Fri, 22 Feb 2013 00:04:24 +0000 (16:04 -0800)]

Add rados_types.h header file

Signed-off-by: David Zafman <david.zafman@inktank.com>

commit | commitdiff | tree

Dan Mick [Fri, 22 Feb 2013 05:41:25 +0000 (21:41 -0800)]

configuration parsing: give better error for missing =

A ceph.conf line with "key" and no "= value" currently shows
"unexpected character while parsing putative key value,
at char N line M". There's no reason it can't be clearer.

Fixes: #4229
Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Fri, 22 Feb 2013 01:55:21 +0000 (17:55 -0800)]

osd/PG: fix typo, missing -> omissing

From ce7ffc34408bf32c66dc07e6f42d54b7ec489d41.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Josh Durgin [Fri, 22 Feb 2013 01:39:19 +0000 (17:39 -0800)]

test_librbd_fsx: fix image closing

Always close the image we opened in check_clone(), and check the
return code of the rbd_close() called before cloning.

Refs: #3958
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Sage Weil [Thu, 21 Feb 2013 23:44:19 +0000 (15:44 -0800)]

objecter: separate out linger_read() and linger_mutate()

A watch is a mutation, while a notify is a read. The mutations need to
pass in a proper snap context to be fully correct.

Also, make the WRITE flag implicit so the caller doesn't need to pass it
in.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Thu, 21 Feb 2013 23:31:08 +0000 (15:31 -0800)]

osd: make watch OSDOp print sanely

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Fri, 22 Feb 2013 01:30:46 +0000 (17:30 -0800)]

Merge branch 'next'

commit | commitdiff | tree

Sage Weil [Fri, 22 Feb 2013 01:29:58 +0000 (17:29 -0800)]

ceph_common.sh: fix iteration of items in ceph.conf

This broke in c8f528a4070dd3aa0b25c435c6234032aee39b21.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Dan Mick [Fri, 22 Feb 2013 01:02:17 +0000 (17:02 -0800)]

ceph-conf.rst: missing '=' in example network settings

Signed-off-by: Dan Mick <dan.mick@inktank.com>

commit | commitdiff | tree

Sage Weil [Thu, 21 Feb 2013 23:45:36 +0000 (15:45 -0800)]

Merge remote-tracking branch 'gh/wsp.bobtail.2merge'

commit | commitdiff | tree

Samuel Just [Thu, 21 Feb 2013 23:31:36 +0000 (15:31 -0800)]

PG::proc_replica_log: adjust oinfo.last_complete based on omissing

Otherwise, search_for_missing may neglect to check the missing
set for some objects assuming that if the need version is
prior to last_complete, the replica must have it.

Fixes: #4994
Signed-off-by: Samuel Just <sam.just@inktank.com>

commit | commitdiff | tree

Samuel Just [Thu, 21 Feb 2013 22:42:33 +0000 (14:42 -0800)]

Merge remote-tracking branch 'upstream/wip_clone_attrs'

Reviewed-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Greg Farnum [Thu, 21 Feb 2013 22:30:42 +0000 (14:30 -0800)]

MDS: remove a few other unnecessary is_base() checks

We should let users remove xattrs as well as set them. ;) And
the check in handle_client_setlayout was totally useless -- perhaps
intended for setdirlayout?

This is a follow-on to 9f82ae60fac30391dfa9d17d2fc014bf9e21f387 and
should be taken wherever it goes.

Signed-off-by: Greg Farnum <greg@inktank.com>

commit | commitdiff | tree

Greg Farnum [Thu, 21 Feb 2013 22:21:08 +0000 (14:21 -0800)]

mds: allow xattrs on the root inode

This was previously disallowed because Once Upon a Time, the root
inode wasn't persisted to disk and was an entirely in-memory construct. But
it's safe now, and has been for a while.

Signed-off-by: Greg Farnum <greg@inktank.com>

commit | commitdiff | tree

Greg Farnum [Thu, 21 Feb 2013 17:22:00 +0000 (09:22 -0800)]

mds: use inode_t::layout for dir layout policy

This cherry-pick is going in the reverse direction of normal. That's
because this direction makes for the minimal change -- this patchset
is required to fix the loss of directory layouts we were previously
seeing, but fixing it requires changing the encoding versions. So we
wrote it on top of Bobtail and let it update the struct_v's as they existed
then. Note that we here change a few encoding versions in ways which are
NOT COMPATIBLE with previous development code (but not any releases). In
particular, development code introduced and this removes the
file_layout_policy_t, and some of the CInode and EMetaBlob encoding
struct_v values were used in development code to mean one thing, but
mean something different due to the Bobtail patch.

Remove the default_file_layout struct, which was just a ceph_file_layout,
and store it in the inode_t. Rip out all the annoying code that put this
on the heap.

To aid in this usage, add a clear_layout() function to inode_t.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Signed-off-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 36ed407e0f939a9bca57c3ffc0ee5608d50ab7ed)
Conflicts:

src/mds/CInode.cc
src/mds/CInode.h
src/mds/MDCache.cc
src/mds/Server.cc
src/mds/events/EMetaBlob.h
Cherry-pick-
Reviewed-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Mon, 21 Jan 2013 05:53:37 +0000 (21:53 -0800)]

mds: parse ceph.*.layout vxattr key/value content

Use qi to parse a strictly formatted set of key/value pairs. Be picky
about whitespace. Any subset of recognized keys is allowed. Parse the
same set of keys as the ceph.*.layout.* vxattrs.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 5551aa5b3b5c2e9e7006476b9cd8cc181d2c9a04)

commit | commitdiff | tree

Sage Weil [Thu, 21 Feb 2013 21:28:47 +0000 (13:28 -0800)]

osdc/Objecter: unwatch is a mutation, not a read

This was causing librados to unblock after the ACK on unwatch, which meant
that librbd users raced and tried to delete the image before the unwatch
change was committed..and got EBUSY. See #3958.

The watch operation has a similar problem.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Samuel Just [Thu, 21 Feb 2013 21:28:26 +0000 (13:28 -0800)]

FileStore::_clone: use _fsetattrs rather than _setattrs

The omap portion of the clone happened above in DBObjectMap::clone.
Only the fs stored attrs need to be explicitely copied.

Signed-off-by: Samuel Just <sam.just@inktank.com>

commit | commitdiff | tree

Samuel Just [Thu, 21 Feb 2013 21:26:56 +0000 (13:26 -0800)]

FileStore::_setattrs: use _fsetattrs

Signed-off-by: Samuel Just <sam.just@inktank.com>

commit | commitdiff | tree

Samuel Just [Thu, 21 Feb 2013 21:26:40 +0000 (13:26 -0800)]

FileStore: add _fsetattrs

Signed-off-by: Samuel Just <sam.just@inktank.com>

commit | commitdiff | tree

Samuel Just [Thu, 21 Feb 2013 21:25:49 +0000 (13:25 -0800)]

FileStore::_setattrs: only do omap operations if necessary

Signed-off-by: Samuel Just <sam.just@inktank.com>

commit | commitdiff | tree

Samuel Just [Thu, 21 Feb 2013 21:24:42 +0000 (13:24 -0800)]

FileStore::_setattrs no need to grab an Index lock for the omap operations

Signed-off-by: Samuel Just <sam.just@inktank.com>

commit | commitdiff | tree

Yehuda Sadeh [Thu, 21 Feb 2013 20:59:06 +0000 (12:59 -0800)]

Merge pull request #67 from jaharkes/content_length

Handle empty CONTENT_LENGTH environment variable.

commit | commitdiff | tree

Jan Harkes [Thu, 21 Feb 2013 20:17:38 +0000 (15:17 -0500)]

Fix failing > 4MB range requests through radosgw S3 API.

When a range request is made for more than rgw_get_obj_max_req_size
bytes the first returned chunk sets 'ret' to STATUS_PARTIAL_CONTENT and
all remaining chunks behave as if there is an error state and only
return a minimal header.

Fix this by passing STATUS_PARTIAL_CONTENT to set_req_state_err, but
leave the 'ret' member variable untouched.

Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit c83a01d4e8dcd26eec24c020c5b79fcfa4ae44a3)

commit | commitdiff | tree

Yehuda Sadeh [Thu, 21 Feb 2013 20:42:06 +0000 (12:42 -0800)]

Merge pull request #66 from jaharkes/range_requests

Fix failing > 4MB range requests through radosgw S3 API.

commit | commitdiff | tree

Jan Harkes [Mon, 18 Feb 2013 21:15:36 +0000 (16:15 -0500)]

Handle empty CONTENT_LENGTH environment variable.

nginx seems to be providing a CONTENT_LENGTH environment variable with no data
when the request body is empty.

Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>

commit | commitdiff | tree

Jan Harkes [Thu, 21 Feb 2013 20:17:38 +0000 (15:17 -0500)]

commit | commitdiff | tree

Sage Weil [Thu, 21 Feb 2013 19:15:58 +0000 (11:15 -0800)]

osd: an interval can't go readwrite if its acting is empty

Let's not forget that min_size can be zero.

Fixes: #4159
Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Josh Durgin [Thu, 21 Feb 2013 19:26:45 +0000 (11:26 -0800)]

librbd: make sure racing flattens don't crash

The only way for a parent to disappear is a racing flatten completing,
or possibly in the future the image being forcibly removed. In either
case, continuing to flatten makes no sense, so stop early.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Josh Durgin [Thu, 21 Feb 2013 19:17:18 +0000 (11:17 -0800)]

librbd: use rwlocks instead of mutexes for several fields

Image metadata like snapshots, size, and parent is frequently read,
but rarely updated. During flatten, we were depending on the parent
lock to prevent the parent ImageCtx from disappearing out from under
us while we read from it. The copy-up path also needed the parent lock
to be able to read from the parent image, which lead to a deadlock.

Convert parent_lock, snap_lock, and md_lock to RWLocks, and change
their use to read instead of exclusive locks where appropriate. The
main place exclusive locks are needed is in ictx_refresh, so this is
pretty simple. This fixes the deadlock, since parent_lock is only
needed for read access in both flatten and the copy-up operation.

cache_lock and refresh_lock are only really used for exclusive access,
so leave them as regular mutexes.

One downside to this is that there's no way to assert is_locked()
for RWLocks, so we'll have to be very careful about changing code
in the future.

Fixes: #3665
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Josh Durgin [Thu, 21 Feb 2013 19:15:41 +0000 (11:15 -0800)]

common: add lockers for RWLocks

This makes them easier to use, especially instead of existing mutexes.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Sage Weil [Thu, 21 Feb 2013 18:44:04 +0000 (10:44 -0800)]

Merge branch 'next'

Conflicts:
src/osd/ReplicatedPG.cc

commit | commitdiff | tree

Sage Weil [Thu, 21 Feb 2013 18:30:08 +0000 (10:30 -0800)]

osd: clear recovery state on pg removal

This ensures we release our in-progress recovery counters, which prevents
recovery from getting blocked indefinitely when a pool removal races with
recovery ops.

Fixes: #4217
Backport: bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>

commit | commitdiff | tree

Josh Durgin [Thu, 21 Feb 2013 01:04:58 +0000 (17:04 -0800)]

test: fix run-rbd-tests pool deletion

Use the new safety check

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Joao Eduardo Luis [Thu, 21 Feb 2013 18:29:36 +0000 (18:29 +0000)]

ceph-object-corpus: use temporary 'wsp.master.new' corpus until we get merged into master

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>

commit | commitdiff | tree

Joao Eduardo Luis [Thu, 21 Feb 2013 18:04:22 +0000 (18:04 +0000)]

Merge branch 'wsp.bobtail.2merge' into wsp.bobtail.master

Conflicts:
src/.gitignore
src/Makefile.am
src/include/ceph_features.h
src/mon/MDSMonitor.cc
src/mon/PGMonitor.cc

commit | commitdiff | tree

Joao Eduardo Luis [Wed, 30 Jan 2013 16:04:36 +0000 (16:04 +0000)]

vstart.sh: Create mon data directory before --mkfs

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>

commit | commitdiff | tree

Joao Eduardo Luis [Wed, 30 Jan 2013 17:54:11 +0000 (17:54 +0000)]

test: ObjectMap: add a generic leveldb store tool

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>

commit | commitdiff | tree

Joao Eduardo Luis [Wed, 20 Feb 2013 18:46:34 +0000 (18:46 +0000)]

mon: ceph-mon: convert an old monitor store to the new format

With the single-paxos patches we shifted from an approach with multiple
paxos instances (one for each paxos service) keeping their own versions
to a single paxos instance for all the paxos services, thus ending up
with a single global version for paxos.

With the release of v0.52, the monitor started tracking these global
versions, keeping them for the single purpose of making it possible to
convert the store to a single-paxos format.

This patch now introduces a mechanism to convert a GV-enabled store to
the single-paxos format store when the monitor is upgraded.

As we require the global versions to be present, we first check if the
store has the GV feature set: if not we will not proceed, but we will
start the conversion otherwise.

In the end of the conversion, the monitor data directory will have a
brand new 'store.db' directory, where the key/value store lies,
alongside with the old store. This makes it possible to revert to a
previous monitor version if things go sideways, without jeopardizing the
data in the store.

The conversion is done as during a rolling upgrade, without any
intervention by the user. Fire up the new monitor version on an old
store, and the monitor itself will convert the store, trim any lingering
versions that might not be required, and proceed to start as expected.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>

commit | commitdiff | tree

Joao Eduardo Luis [Tue, 18 Sep 2012 15:10:39 +0000 (16:10 +0100)]

mon: Add an offline monitor store converter

This tool will convert an old monitor store format (bobtail) to the new
key/value store-backed, single-paxos format.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>

commit | commitdiff | tree

Joao Eduardo Luis [Mon, 17 Sep 2012 17:08:05 +0000 (18:08 +0100)]

os: LevelDBStore: scrap init() and create open() and create_and_open()

The init() function always implicitly created a new store if it was
missing.

This patches makes init() a private function accepting a bool that used
to specify whether or not we want to create the store if it does not
exists, and creates two functions: open() and create_and_open().

open() will fail if the store we are trying to open does not exist;
create_and_open() maintains the same behavior as the previous behavior of
init() and will create the store if it does not exist before opening it.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>

commit | commitdiff | tree

Joao Eduardo Luis [Tue, 12 Feb 2013 13:25:16 +0000 (13:25 +0000)]

mon: Monitor: Add monitor store synchronization support

Synchronize two monitor stores when one of the monitors has diverged
significantly from the remaining monitor cluster.

This process roughly consists of the following steps:

  0. mon.X tries to join the cluster;
  1. mon.X verifies that it has diverged from the remaining cluster;
  2. mon.X asks the leader to sync;
  3. the leader allows mon.X to sync, pointing out a mon.Y from
     which mon.X should sync;
  4. mon.X asks mon.Y to sync;
  5. mon.Y sends its own store in one or more chunks;
  6. mon.X acks each received chunk; go to 5;
  7. mon.X receives the last chunk from mon.Y;
  8. mon.X informs the leader that it has finished synchronizing;
  9. the leader acks mon.X's finished sync;
10. mon.X bootstraps and retries joining the cluster (goto 0.)

This is the most simple and straightforward process that can be hoped
for. However, things may go sideways at any time (monitors failing, for
instance), which could potentially lead to a corrupted monitor store.
There are however mechanisms at work to avoid such scenario at any step
of the process.

Some of these mechanisms include:

- aborting the sync if the leader fails or leadership changes;
- state barriers on synchronization functions to avoid stray/outdated
   messages from interfering on the normal monitor behavior or on-going
   synchronization;
- store clean-up before any synchronization process starts;
- store clean-up if a sync process fails;
- resuming sync from a different monitor mon.Z if mon.Y fails mid-sync;
- several timeouts to guarantee that all the involved parties are still
   alive and participating in the sync effort.
- request forwarding when mon.X contacts a monitor outside the quorum
   that might know who the leader is (or might know someone who does)
   [4].

Changes:
  - Adapt the MMonProbe message for the single-paxos approach, dropping
    the version map and using a lower and upper bound version instead.
  - Remove old slurp code.
  - Add 'sync force' command; 'sync_force' through the admin socket.

Notes:

[1] It's important to keep track of the paxos version at the time at
    which a store sync starts.  Given that after the sync we end up with
    the same state as the monitor we are synchronizing from, there is a
    chance that we might end up with an uncommitted paxos version if we
    are synchronizing with the leader (there's some paxos stashing done
    prior to commit on the leader).  By keeping track at which version
    the sync started, we can then let the requester to which version he
    should cap its paxos store.

[2] Furthermore, the enforced paxos cap, described on [1], is even more
    important if we consider the need to reapply the paxos versions that
    were received during the sync, to make sure the paxos store is
    consistent.  If we happened to have some yet-uncommitted version in
    the store, we could end up applying it.

[3] What is described in [1] and [2]:

Fixes: #4026
Fixes: #4037
Fixes: #4040
[4] Whenever a given monitor mon.X is on the probing phase and notices
    that there is a mon.Y with a paxos version considerably higher than
    the one mon.X has, then mon.X will attempt to synchronize from
    mon.Y.  This is the basis for the store sync.  However this might
    hold true, the fact is that there might be a chance that, by the
    time mon.Y handles the sync request from mon.X, mon.Y might already
    be attempting a sync himself with some other mon.Z.  In this case,
    the appropriate thing for mon.Y to do is to forward mon.X's request
    to mon.Z, as mon.Z should be part of the quorum, know who the leader
    is or be the leader himself -- if not, at least it is guaranteed
    that mon.Z has a higher version than both mon.X and mon.Y, so it
    should be okay to sync from him.

Fixes: #4162
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>

commit | commitdiff | tree

Joao Eduardo Luis [Mon, 9 Jul 2012 21:51:49 +0000 (22:51 +0100)]

message: MMonSync: Monitor Synchronization message

The monitor's synchronization process requires a specific message type
to carry the required informations. Since this process significantly
differs from slurping, reusing the MMonProbe message is not an option as
it would require major changes and, for all intetions and purposes, it
would be far outside the scope of the MMonProbe message.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>

commit | commitdiff | tree

Joao Eduardo Luis [Wed, 15 Aug 2012 14:35:39 +0000 (15:35 +0100)]

mon: MonitorDBStore: add store iterators to obtain chunks for sync

We created an interface specific to the MonitorDBStore, which can be used
to create iterators to obtain chunks for sync.

Two different iterators were defined: one that will iterate over the whole
store, focusing on the specified set of prefixes; another that will
iterate over only one specific prefix.

These two different iterators allow us build the sync process in two
distinct phases: 1) obtain all key/value pairs for paxos and all paxos
services, bundle them in chunks and send them over the wire; and 2) obtain
all the paxos versions, bundle them in chunks and send them over the wire.

Also, we are currently considering a chunk to be (at most) 1 MB worth of
data, although it can be tuned using 'mon_sync_max_payload_size' option.

mon: MonitorDBStore: add crc support when --mon-sync-debug is set

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>

commit | commitdiff | tree

Joao Eduardo Luis [Fri, 31 Aug 2012 17:39:27 +0000 (18:39 +0100)]

mon: Paxos: get rid of slurp-related code

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>

commit | commitdiff | tree

Joao Eduardo Luis [Fri, 28 Sep 2012 15:10:29 +0000 (16:10 +0100)]

mon: PaxosService: rework full version stashing

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>

commit | commitdiff | tree

Joao Eduardo Luis [Wed, 4 Jul 2012 10:47:03 +0000 (11:47 +0100)]

mon: Paxos: trim through Paxos

Instead of directly modifying the store whenever we want to trim our Paxos
state, we should do it through Paxos, proposing the trim to the quorum and
commit it once accepted.

This enforces three major invariants that we will be able to leverage later
on during the store synchronization:

1) The Leader will set the pace for trimming across the system. No one
    will trim their state unless they are committing the value proposed by
    the Leader;

2) Following (1), the monitors in the quorum will trim at the same time.
    There will be no diverging states due to trimming on different monitors.

3) Each trim will be kept as a transaction in the Paxos' store allowing
    us to obtain a consistent state during synchronization, by shipping
    the Paxos versions to the other monitor and applying them. We could
    incur in an inconsistent state if the trim happened without
    constraints, without being logged; by going through Paxos this concern
    is no longer relevant.

The trimming itself may be triggered each time a proposal finishes, which
is the time at which we know we have committed a new version on the store.

It shall be triggered iff we are sure we have enough versions on the store
to fill the gap of any monitor that might become alive and still hasn't
drifted enough to require synchronization. Roughly speaking, we will check
if the number of available versions is higher than 'paxos_max_join_drift'.

Furthermore, we added a new option, 'paxos_trim_tolerance', so we are able
to avoid trimming every single time the above condition is met -- which
would happen every time we trimmed a version, and then proposed a new one,
and then we would trim it again, etc. So, just tolerate a couple of commits
before trimming again.

Finally, we added support to enable/disable trimming, which will be
essential during the store synchronization process.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>

commit | commitdiff | tree

Joao Eduardo Luis [Mon, 11 Jun 2012 13:55:21 +0000 (14:55 +0100)]

mon: Single-paxos and key/value store support

We are converting the monitor subsystem to a Single-Paxos architecture,
backed by a key/value store. The previous architecture used a Paxos
instance for each Paxos Service, backed by a nasty Monitor Store that
provided few to no consistency guarantees whatsoever, which led to a fair
amount of workarounds.

Changes:

* Paxos:
  - Add k/v store support
  - Add documentation describing the new Paxos storage layout and behavior
  - Get rid of the stashing code, which was used as a consistency point
    mechanism (we no longer need it, because of our k/v store)
  - Debug level of 30 will output json-formatted transaction dumps
  - Allows for proposal queueing, to be proposed in the same order as
    they were queued.
  - No more 'is_leader()' function, using instead the Monitor's for
    enhanced simplicity.
  - Add 'is_lease_valid()' function.
  - Disregard 'stashed versions'
  - Make the paxos 'state' variable a bit-map, so we lock the proposal
    mechanism while maintaining the state [5].
  - Related notes: [3]

* PaxosService:
  - Add k/v store support, creating wrappers to be used by the services
  - Add documentation
  - Support single-paxos behavior, creating wrappers to be used by the
    services and service-specific version
  - Rearrange variables so they are neatly organized in the beginning of
    the class
  - Add a trim_to() function to be used by the services, instead of letting
    them rely on Paxos::trim_to(), which is no longer adequate to the job
    at hand
  - Debug level of 30 will output json-formatted transaction dumps
  - Support proposal queueing, taking it into consideration when
    assessing the current state of the service (active, writeable,
    readable, ...)
  - Redefine the conditions for 'is_{active,readable,writeable}()' given
    the new single-paxos approach, with proposal queueing [1].
  - Use our own waiting_for_* callback lists, which now must be
    dissociated from their Paxos counterparts [2].
  - Related notes: [3], [4]

* Monitor:
  - Add k/v store support
  - Use only one Paxos instance and pass it down to each service instance
  - Crank up CEPH_MON_PROTOCOL to 10

* {Auth,Log,MDS,Monmap,OSD,PG}Monitor:
  - Add k/v store support
  - Add single-paxos support

* AuthMonitor:
  - Don't always propose full versions: if the KeyServer doesn't have
    keys, we cannot propose a full version. This should only happen when
    we start with a brand new store and we are creating the first
    pending proposal, and if we were to commit a full version filled
    with nothing but a big void of nothingness, we could eventually end
    up with a corrupted version.

* Elector:
  - Add k/v store support
  - Add single-paxos support

* ceph-mon:
  - Use the monitor's k/v store instead of MonitorStore

* MMonPaxos:
  - remove the machine_id field: This field was used to identify from/to
    which paxos service a given message belonged. We no longer have a Paxos
    for each service, so this field became obsolete.

Notes:

[1] Redefine the conditions for 'is_{active,readable,writeable}()' on
    the PaxosService class, to be used with single-paxos and proposal
    queueing:

  We should not rely on the Paxos::is_*() functions, since they do not apply
  directly to the PaxosService.

  All the PaxosService classes share the same Paxos class, but they do not
  rely on its values. Each service only relies, uses and updates its own
  values on the k/v store. Thus, we may have a given service (e.g., the
  OSDMonitor) proposing a new value, hence updating or waiting to update its
  store, and we may still consider the LogMonitor as being able to read and
  write its own values on the k/v store. In a nutshell, different services
  do not overlap on their access to their own store when it comes to reading,
  and since the Paxos will queue their updates and deal with them in a FIFO
  order, their updates won't overlap either.

  Therefore, the conditions for the PaxosService::is_{active,readable,
  writeable} differ from those on the Paxos::is_{active,readable,writeable}.

  * PaxosService::is_active() - the PaxosService will be considered as
  active iff it is not proposing and the Paxos is not recovering. This
  means that a given PaxosService (e.g., the OSDMonitor) may be considered
  as being active even though some other service (e.g., the LogMonitor) is
  proposing a new value and the Paxos is on the UPDATING state. This means
  that the OSDMonitor will be able to read its own versions and queue any
  changes on to the Paxos. However, if the Paxos is on state RECOVERING,
  we cannot be considered as active.

  * PaxosService::is_writeable() - We will be able to propose new values
  iff we are the Leader, we have a valid lease, and we are not already
  proposing. If we are proposing, we must wait for our proposal to finish
  in order to proceed with writing to our k/v store; otherwise we could
  incur in assuming that our last committed version was, say, 10; then
  assign map epochs/versions taking that into consideration, make changes
  to the store based on those values, just to come to smash previously
  proposed values on the store. We really don't want that. To be fair,
  there was a chance we could assume we were always writable, but there
  may be unforeseen consequences to this; so we take the conservative
  approach here for now, and we will relax it in the future if we believe
  it to be fruitful.

  * PaxosService::is_readable() - We will be readable iff we are not
  proposing and the Paxos is not recovering; if our last committed version
  exists; and if we are either a cluster of one or we have a valid lease.

[2] Use own waiting_for_* callback lists on PaxosService, which now must
    be dissociated from their Paxos counterparts:

  We were relying on Paxos to wait for state changes, but since our state
  became somewhat independent from the Paxos state, we have to deal with
  callbacks waiting for 'readable', 'writable' or 'active' on different
  terms than those that Paxos provide.

  So, basically, we will take one of two approaches when it comes to waiting:

  * If we are proposing, queue ourselves on our own list, waiting for the
  proposal to finish;
  * Otherwise, the cause for the need to wait comes from Paxos, so queue
  the callback directly on Paxos.

  This approach means that we must make sure to check our desired state
  whenever the callback is fired up, and re-queue ourselves if the state
  didn't quite change (or if it changed but our waiting condition result
  didn't). For instance, if we were waiting for a proposal to finish due to
  a failed 'is_active()', we will need to recheck if we are active before
  continuing once the callback is fired. This is mainly because we may have
  finished our proposal, but a new Election may have been called and the
  Paxos may not be active.

[3] Propose everything in the queue before bootstrapping, but don't
    allow new proposals:

  The MonmapMonitor may issue bootstraps once it is updated. We must ensure
  that we propose every single pending proposal before we actually do it.

  However, ee don't want to propose if we are going to bootstrap; otherwise,
  we may end up losing proposals.

[4] Handle the case when first_committed_version equals 0 on a
    PaxosService

  In a nutshell, the services do not set the first committed version, as
  they consider it as a SEP (Somebody Else's Problem). They do rely on it
  though, and we, the PaxosService, must ensure that it contains a valid
  value (that is, higher than zero) at all times.

  Since we will only have a first_committed version equal to zero once,
  and that is before the service's first proposal, we are safe to simply
  read the variable from the store and assign the first_committed the same
  value as the last_committed iff the first_committed version is zero.

  This also affects trimming, since trimming relies on the first_committed
  version as the lower bound for version trimming. Even though the k/v store
  will gracefully ignore any problem from trying to remove non-existent
  versions, the main issue would still stand: we'd be removing a non-existent
  version and that just doesn't make any sense.

[5] 'lock' paxos when we are running some internal proposals

  Force the paxos services to wait for us to complete whatever we are
  doing before they can proceed.  This is required because on certain
  occasions we might need to run internal proposals, not affected to any of
  the paxos services (for instance, when learning an old value), and we need
  them to stay put, or they might incur in erroneous state and crash the
  monitor.

  This could have been done with an extra bool, but there was no point
  in creating a new variable when we can just as easily reuse the
  'state' variable for our twisted interests.

Fixes: #4175
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>

commit | commitdiff | tree

Sage Weil [Mon, 21 Jan 2013 05:53:37 +0000 (21:53 -0800)]

commit | commitdiff | tree

Samuel Just [Wed, 20 Feb 2013 21:29:31 +0000 (13:29 -0800)]

Merge branch 'wip_watch_cleanup'

Reviewed-by: Greg Farnum <greg@inktank.com>

commit | commitdiff | tree

Samuel Just [Wed, 20 Feb 2013 00:19:20 +0000 (16:19 -0800)]

ReplicatedPG: allow multiple watches in one transaction

Signed-off-by: Samuel Just <sam.just@inktank.com>

commit | commitdiff | tree

Samuel Just [Thu, 14 Feb 2013 01:14:12 +0000 (17:14 -0800)]

doc: add some internal docs for watch/notify

Signed-off-by: Samuel Just <sam.just@inktank.com>

commit | commitdiff | tree

Samuel Just [Tue, 12 Feb 2013 22:04:55 +0000 (14:04 -0800)]

librados/: include watch cookie in notify_ack

Signed-off-by: Samuel Just <sam.just@inktank.com>

commit | commitdiff | tree

Samuel Just [Tue, 12 Feb 2013 21:43:36 +0000 (13:43 -0800)]

ReplicatedPG: accept watch cookie value with notify ack

Signed-off-by: Samuel Just <sam.just@inktank.com>

commit | commitdiff | tree

Samuel Just [Fri, 15 Feb 2013 18:45:47 +0000 (10:45 -0800)]

Watch/Notify: rework watch/notify

Signed-off-by: Samuel Just <sam.just@inktank.com>

commit | commitdiff | tree

Samuel Just [Fri, 15 Feb 2013 18:43:45 +0000 (10:43 -0800)]

osd/: move ObjectContext over to osd_types.h

Signed-off-by: Samuel Just <sam.just@inktank.com>

commit | commitdiff | tree

Samuel Just [Tue, 12 Feb 2013 21:08:36 +0000 (13:08 -0800)]

PG: check object_contexts on flushed

At FlushedEvt, all outstanding io should be complete and
the object_contexts map should be empty.

Signed-off-by: Samuel Just <sam.just@inktank.com>

commit | commitdiff | tree

Samuel Just [Sat, 2 Feb 2013 00:42:47 +0000 (16:42 -0800)]

ReplicatedPG: add intrusive_ptr hooks

Signed-off-by: Samuel Just <sam.just@inktank.com>

commit | commitdiff | tree

Samuel Just [Sat, 2 Feb 2013 00:40:32 +0000 (16:40 -0800)]

Timer.cc: use complete() rather than finish()

Signed-off-by: Samuel Just <sam.just@inktank.com>

commit | commitdiff | tree

Sage Weil [Wed, 20 Feb 2013 21:25:58 +0000 (13:25 -0800)]

osd: remove force hack for testing the HASHPSPOOL code

Also from 8cc2b0f1243b2717af1de329a7fa6a8b5350db68.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Wed, 20 Feb 2013 20:47:38 +0000 (12:47 -0800)]

mon: allow syslog level and facility for cluster log to be controlled

Allow user to control the minimum level to go to syslog for the client-
and server-side submission paths for the cluster log, along with the syslog
'facility'. See syslog(3) man page.

Also move the level checks into a LogEntry method.

Closes: #3704
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Luis <joao.luis@inktank.com>

commit | commitdiff | tree

Joao Eduardo Luis [Mon, 11 Jun 2012 14:13:05 +0000 (15:13 +0100)]

mon: Monitor: keyring always on mon_data/keyring by default

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>

commit | commitdiff | tree

Joao Eduardo Luis [Tue, 12 Jun 2012 22:51:10 +0000 (23:51 +0100)]

mon: MonitorDBStore: Add a key/value store to be used in the monitor

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>

commit | commitdiff | tree

Yehuda Sadeh [Wed, 20 Feb 2013 20:39:37 +0000 (12:39 -0800)]

rgw: refactor header grants

Move definition to a static array.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

commit | commitdiff | tree

caleb miles [Tue, 19 Feb 2013 17:15:30 +0000 (12:15 -0500)]

rgw_acl: Support ACL grants in headers.

Issue 3669: Support S3 ACL grants specified in request headers. Allow
requests, excluding POST object, to specify ACL grants in HTTP headers.

Signed-off-by: caleb miles <caleb.miles@inktank.com>
Conflicts:
src/rgw/rgw_acl_s3.cc
src/rgw/rgw_acl_s3.h
src/rgw/rgw_rest_s3.cc
src/rgw/rgw_rest_s3.h

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

commit | commitdiff | tree

Sage Weil [Wed, 20 Feb 2013 18:37:01 +0000 (10:37 -0800)]

mon: fix new pool type

I broke this in 8cc2b0f1243b2717af1de329a7fa6a8b5350db68.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Wed, 20 Feb 2013 06:20:47 +0000 (22:20 -0800)]

osd: lock pg in build_past_intervals_parallel()

Methods called by write_if_dirty() (get_osdmap()) assert that the pg
is locked.

Backport: bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>

commit | commitdiff | tree

Sage Weil [Wed, 20 Feb 2013 16:44:03 +0000 (08:44 -0800)]

qa: mon/pool_ops.sh: fix last test

Got this one backwards, bah!

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Greg Farnum [Wed, 20 Feb 2013 02:00:44 +0000 (18:00 -0800)]

doc: make the cephfs man page marginally more truthful

Put it in the right place this time.

Signed-off-by: Greg Farnum <greg@inktank.com>

commit | commitdiff | tree

Yehuda Sadeh [Mon, 18 Feb 2013 17:10:43 +0000 (09:10 -0800)]

rgw: fix multipart uploads listing

Fixes: #4177
Backport: bobtail
Listing multipart uploads had a typo, and was requiring the
wrong resource (uploadId instead of uploads).

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>

commit | commitdiff | tree

Yehuda Sadeh [Fri, 15 Feb 2013 18:22:54 +0000 (10:22 -0800)]

rgw: don't copy object when it's copied into itself

Fixes: #4150
Backport: bobtail

When object copied into itself, object will not be fully copied: tail
reference count stays the same, head part is rewritten.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>

Unnamed repository; edit this file 'description' to name the repository.

RSS Atom