Sage Weil [Thu, 7 Jun 2012 22:57:09 +0000 (15:57 -0700)]
crush: make magic numbers tunable
We have three magic numbers in crush_choose that are now tunable. The
first two control the local retry behavior, including fallback to a
permutation. The last is the total map descent attempts.
We can avoid a drastic incompatibility by making these tunable and encoded
in the map. That means users can enable/disable local retry, for example,
without changing the code. As long as the clients understand the tunables,
they can be adjusted.
This patch doesn't address the compatibility and feature bit issue. We may
want to roll that into a larger revision with more drastic changes, once
we know what those changes will look like. However, a careful user can
use the new code and modify the behavior.
Sage Weil [Wed, 6 Jun 2012 23:06:28 +0000 (16:06 -0700)]
assert: detect when /usr/include/assert.h clobbers us
The normal assert.h is very rude in that it clobbers any existing assert
define and replaces it with its own. An sadly, lots of things we include
include the generic version.
Be extra rude in response. Clobber any existing assert #define, and also
#define _ASSERT_H to be a magic value that our commonly-used dendl #define
depends on. This way we get a compile error if the system version replaces
out own.
This is imperfect, since we will only detect their rudeness when we use
the debug macros. I'm not coming up with something that is more widely
used that would work better, however.
Sage Weil [Wed, 6 Jun 2012 22:30:36 +0000 (15:30 -0700)]
keyserver: also authenticate against mon keyring
If we don't have a secret, also check in the extra_secrets keyring.
This means we can also authenticate as any users that appear in the mon
keyring, and get the caps defined there. This lets us bootstrap the
client.admin key with mon. key, provided mon 'allow *' caps appear in the
mon keyring.
Sage Weil [Tue, 5 Jun 2012 23:15:43 +0000 (16:15 -0700)]
logclient: fix crashes, fix which entries are sent
I was seeing crashes when the monitor tried to send log entries.
* Send log entries that haven't already been sent.
* Don't try to be tricky with the deque; i'm paranoid about the stability
of the iterator.
* various asserts
* better variable names
Sage Weil [Tue, 5 Jun 2012 21:52:42 +0000 (14:52 -0700)]
monclient: send more log entries when first set is acked
Immediately send more log messages if we had more when the first set was
sent. Otherwise, wait until the next tick to check. This semi-throttles
logging based on how much the monitor can handle.
Sage Weil [Tue, 5 Jun 2012 21:51:17 +0000 (14:51 -0700)]
logclient: not a dispatcher
Let MonClient and Monitor handle delivery of messages. This puts them in
control and lets them trigger sending of more messages when we have a
bunch queued.
Samuel Just [Tue, 5 Jun 2012 16:58:38 +0000 (09:58 -0700)]
OSD: do not convert an entire collection in one transaction
Previously, we atomically moved the collection out of the way, created a
new collection, moved the contents of the old collection to the new
collcetion and removed the old collection. For large collections, this
could result in unacceptably long transactions. Now, we create a temp
collection, link all objects into it, atomically swap them, and then
remove the old one.
Samuel Just [Fri, 1 Jun 2012 05:29:21 +0000 (22:29 -0700)]
FileStore,DBObjectMap: add SequencerPosition argument to ObjectMap
Previously, sequences like:
1. touch (c1, a)
2. link (c1, c2, a)
3. rm (c1, a)
4. setattr (c2, a)
5. clone (c2, a, b)
could result in the omap entries for a being removed once ops 2-3 are
replayed. Calls to ObjectMap::sync will include a sequencer posotion
and an hobject_t to mark. Updates to the object map will now also check
the SequencerPosition entry on the map header preventing replay of
earlier ops.
Samuel Just [Wed, 30 May 2012 23:01:38 +0000 (16:01 -0700)]
OSD,FileStore: clean up filestore convsersion
Previously, we messed with the filestore_update_collections config
option to enable upgrades in the filestore. We now pass that in as a
parameter to the FileStore,IndexManager constructors.
Further, the user must now specify the version to which to update in
order to prevent accidental updates.
Samuel Just [Wed, 30 May 2012 00:08:45 +0000 (17:08 -0700)]
ReplicatedPG: adjust missing at push_start
When we start recieving an object, we remove the old copy. This will
prevent the primary from using that old copy after that point.
We do the same on the pushee.
Samuel Just [Sat, 26 May 2012 02:18:41 +0000 (19:18 -0700)]
DBObjectMap: restructure for unique hobject_t's
Previously, the ObjectStore operated in terms of (coll_t,hobject_t)
tupples. Now that hobject_t's are globally unique within the
ObjectStore, it is no longer necessary to support multiple names for the
same DBObjectMap node.
Signed-off-by: Samuel Just <sam.just@dreamhost.com>
Samuel Just [Fri, 25 May 2012 22:06:55 +0000 (15:06 -0700)]
FileStore,DBObjectMap: remove ObjectMap link method
hobject_t's are now globally unique in filestore. Essentially, there is
a 1-to-1 mapping from inodes to hobject_t's. The entry in the
DBObjectMap is now tied to the inode/hobject_t. Thus, links needn't be
tracked. Rather, we delete the ObjectMap entry when nlink == 0.
Samuel Just [Thu, 24 May 2012 17:57:22 +0000 (10:57 -0700)]
src/: Add namespace and pool fields to hobject_t
From this point, hobjects in the ObjectStore will be globally unique. This
will allow us to avoid including the collection in the ObjectMap key encoding
and thereby enable efficient collection renames and, eventually, collection
splits.
Greg Farnum [Fri, 1 Jun 2012 23:46:39 +0000 (16:46 -0700)]
msg: make clear_pipe work only on a given Pipe, rather than the current one.
This way old Pipes that have been replaced can't clear the new Pipe
out of a Connection's link.
We might attempt to instead sever the link between CLOSED Pipes and
their Connections more completely (eg, when the Connection gets a
new Pipe), but that will require more work to handle all the
cases, and this works for now.
Sage Weil [Sat, 2 Jun 2012 22:19:28 +0000 (15:19 -0700)]
upstart: simplify start; allow group stop via an abstract job
Use a 'ceph-mds' or 'ceph-mon' event to start instances instead of
explicitly calling start. This avoids the ugly is-this-already-running
check. [Thanks Guilhem Lettron for that!]
Make the -all job abstract (which means it stays started and can be
stopped). Trigger a helper task (-all-starter) to trigger instance
start. Make instances stop with the -all task. This allows you to do