]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
13 years agoosd: simplify finalizing scrub on replica
Sage Weil [Fri, 21 Oct 2011 16:56:19 +0000 (09:56 -0700)]
osd: simplify finalizing scrub on replica

We can simply call osr.flush() (with pg lock held) to ensure that prior
writes are visible and scrubbable.  This avoids the funky handoff to
op_applied() (which didn't seem to work for me just now, although I didn't
fully debug.

In any case, this is much simpler.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: PriorSet: acting/up membership implies still alive
Sage Weil [Fri, 21 Oct 2011 16:14:15 +0000 (09:14 -0700)]
osd: PriorSet: acting/up membership implies still alive

If the osd is in the acting or up sets, we can assume they are still alive,
even though we don't know that for sure, because if they are not, we will
rebuild PriorSet.

Note that we have a dependency here on up_thru that we could/should rebuild
PriorSet based on, IF we think it might change the value of the CRASHED
flag and IF we care enough.  Right now we don't.  Marking CRASHED when we
don't need to is conservative, and not dangerous.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoMerge remote branch 'gh/master' into wip-prior
Sage Weil [Fri, 21 Oct 2011 15:58:43 +0000 (08:58 -0700)]
Merge remote branch 'gh/master' into wip-prior

Conflicts:
src/osd/PG.cc

13 years agoOSDMonitor: reweight towards average utilization
Josh Durgin [Fri, 21 Oct 2011 00:13:21 +0000 (17:13 -0700)]
OSDMonitor: reweight towards average utilization

The existing reweight-by-utilization calculation did not take into
account the current weight of an OSD, and depended in part on the
threshold given by the user. Also send the user both the old and new
weights.

Fixes: #1636
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
13 years agoosd: PG::PriorSet: make debug_pg arg const
Sage Weil [Thu, 20 Oct 2011 22:56:15 +0000 (15:56 -0700)]
osd: PG::PriorSet: make debug_pg arg const

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: PgPriorSet -> PriorSet
Sage Weil [Thu, 20 Oct 2011 22:51:10 +0000 (15:51 -0700)]
osd: PgPriorSet -> PriorSet

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: PgPriorSet: rename prior_set_affected -> affected_by_map
Sage Weil [Thu, 20 Oct 2011 22:50:33 +0000 (15:50 -0700)]
osd: PgPriorSet: rename prior_set_affected -> affected_by_map

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: PgPriorSet: remove obsolete comment
Sage Weil [Thu, 20 Oct 2011 22:47:54 +0000 (15:47 -0700)]
osd: PgPriorSet: remove obsolete comment

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: PgPriorSet: move prior_set_affected into PgPriorSet
Sage Weil [Thu, 20 Oct 2011 22:46:43 +0000 (15:46 -0700)]
osd: PgPriorSet: move prior_set_affected into PgPriorSet

This is really where it belongs.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: PgPriorSet: kill whoami; make PG arg strictly optional
Sage Weil [Thu, 20 Oct 2011 22:46:11 +0000 (15:46 -0700)]
osd: PgPriorSet: kill whoami; make PG arg strictly optional

It is only used for the debug output prefix.  Make it so we can leave it
out entirely (e.g. for unit tests).

We don't want to, say, pass in the string prefix itself, or else we are
stuck with generating that string even on low debug levels where it won't
be used.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoMerge branch 'stable'
Sage Weil [Thu, 20 Oct 2011 21:12:40 +0000 (14:12 -0700)]
Merge branch 'stable'

13 years agoosd: fix requeue_ops
Sage Weil [Thu, 20 Oct 2011 21:11:20 +0000 (14:11 -0700)]
osd: fix requeue_ops

The ls argument passed to requeue_ops() is a reference, and one of the
methods we call (say, _handle_op) might want to requeue the message on the
same list we were passed, leading to an infinite loop.

Set ls contents aside to avoid that.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
13 years agoperfcounters: remove dout
Sage Weil [Thu, 20 Oct 2011 20:59:12 +0000 (13:59 -0700)]
perfcounters: remove dout

We can't use this because we're part of libglobal and there is no
g_ceph_context.  And i'm too lazy to use cct.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
13 years agoperfcounters: fix unit test
Sage Weil [Thu, 20 Oct 2011 20:58:14 +0000 (13:58 -0700)]
perfcounters: fix unit test

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
13 years agoMerge remote branch 'gh/wip-unfound'
Sage Weil [Thu, 20 Oct 2011 20:48:44 +0000 (13:48 -0700)]
Merge remote branch 'gh/wip-unfound'

13 years agofilestore: measure commit interval, latency, journal full count
Sage Weil [Thu, 20 Oct 2011 20:16:28 +0000 (13:16 -0700)]
filestore: measure commit interval, latency, journal full count

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
13 years agoosd: clean up perfcounter names
Sage Weil [Thu, 20 Oct 2011 19:45:34 +0000 (12:45 -0700)]
osd: clean up perfcounter names

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
13 years agofilestore: simplify, clean up perfcounters
Sage Weil [Thu, 20 Oct 2011 19:43:20 +0000 (12:43 -0700)]
filestore: simplify, clean up perfcounters

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
13 years agofilestore: simplify perfcounter lifecycle
Sage Weil [Thu, 20 Oct 2011 19:00:28 +0000 (12:00 -0700)]
filestore: simplify perfcounter lifecycle

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
13 years agoperfcounters: fix addition/removal
Sage Weil [Thu, 20 Oct 2011 18:34:23 +0000 (11:34 -0700)]
perfcounters: fix addition/removal

We are not responsible for deleting removed perfcounters.

Add debugging.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
13 years agofilestore: fix perfcounter definition
Sage Weil [Thu, 20 Oct 2011 18:33:38 +0000 (11:33 -0700)]
filestore: fix perfcounter definition

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
13 years agofilestore: fix logger start
Sage Weil [Thu, 20 Oct 2011 17:59:03 +0000 (10:59 -0700)]
filestore: fix logger start

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
13 years agoperfcounters: use simple names
Sage Weil [Thu, 20 Oct 2011 16:19:45 +0000 (09:19 -0700)]
perfcounters: use simple names

We don't need to uniquely identify ourselves in the global namespace with
the PerfCounter name.. only in the current process.  Collectd will handle
the per-daemon naming part.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
13 years agoperfcounters: clean up interface a bit
Sage Weil [Thu, 20 Oct 2011 16:17:17 +0000 (09:17 -0700)]
perfcounters: clean up interface a bit

No logger_ prefix necessary.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
13 years agoassert: no 0x before thread id
Sage Weil [Sun, 16 Oct 2011 17:36:31 +0000 (10:36 -0700)]
assert: no 0x before thread id

There's no 0x prefix in the log lines either.  This makes it easier to
copy/paste word and search.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
13 years ago.gitignore: add test_filestore_idempotent
Josh Durgin [Wed, 19 Oct 2011 22:35:56 +0000 (15:35 -0700)]
.gitignore: add test_filestore_idempotent

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
13 years agotest_filestore_idempotent: initialize var
Josh Durgin [Wed, 19 Oct 2011 22:35:28 +0000 (15:35 -0700)]
test_filestore_idempotent: initialize var

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
13 years agoMerge branch 'stable'
Sage Weil [Wed, 19 Oct 2011 16:14:24 +0000 (09:14 -0700)]
Merge branch 'stable'

Conflicts:
src/mon/OSDMonitor.cc
src/osd/OSD.cc

13 years agotest_filestore_idempotent: simple tool to generate a worklaod of non-idempotent opera...
Sage Weil [Tue, 18 Oct 2011 21:12:06 +0000 (14:12 -0700)]
test_filestore_idempotent: simple tool to generate a worklaod of non-idempotent operations

Generate a workload of operations that are non-idempotent.  These are:

 transaction {
   clone A -> A.($n-1)
   write $n to A
 }
 $n++
 loop!

If we apply any transaction to the file system more than once, we will
find that the A.$n object does not contain $n, but instead contains
some larger value.

First run in 'write' mode to generate a workload and fake a crash.
Then run in 'verify' mode to see if the result was bad.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agofilestore: tolerate EEXIST on mkcoll when not-btrfs
Sage Weil [Tue, 18 Oct 2011 21:13:11 +0000 (14:13 -0700)]
filestore: tolerate EEXIST on mkcoll when not-btrfs

For non-btrfs file systems we should tolerate EEXIST because we may
replay the event more than once.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agomds: handle xattrs on inode creation
Sage Weil [Tue, 18 Oct 2011 20:38:21 +0000 (13:38 -0700)]
mds: handle xattrs on inode creation

Allow mknod, mkdir, symlink, create to provide xattrs for the new
inode.  This will be used by the kclient to set ACLs on new inodes
based on the parent directory.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoradosgw-admin: fix conflict with KeyType in libnss
Sage Weil [Tue, 18 Oct 2011 23:04:22 +0000 (16:04 -0700)]
radosgw-admin: fix conflict with KeyType in libnss

rgw/rgw_admin.cc:459:6: error: using typedef-name 'KeyType' after 'enum'
/usr/include/nss3/keythi.h:69:3: error: 'KeyType' has a previous declaration here

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: PgPriorSet: cur -> probe
Sage Weil [Tue, 18 Oct 2011 18:42:16 +0000 (11:42 -0700)]
osd: PgPriorSet: cur -> probe

Rename cur to probe, the set of OSDs we need to probe in order to
successfully peer.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: PgPriorSet: restructure lost checks for prior set
Sage Weil [Tue, 18 Oct 2011 18:40:06 +0000 (11:40 -0700)]
osd: PgPriorSet: restructure lost checks for prior set

When we add down osds to the cur set, we block peering because there
are OSDs that may have data we need and they are not currently up.
When that happens, marking those OSDs as lost may allow peering to
proceed.

Keep an explicit map blocked_by for exactly that set of OSDs (a subset
of cur), and compare lost_by values in prior_set_affected() to that.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agorgw: workqueue suicide timeout is infinity
Yehuda Sadeh [Tue, 18 Oct 2011 18:01:43 +0000 (11:01 -0700)]
rgw: workqueue suicide timeout is infinity

13 years agoosd: PgPriorSet: simplify (and change) CRASHED logic
Sage Weil [Tue, 18 Oct 2011 01:59:10 +0000 (18:59 -0700)]
osd: PgPriorSet: simplify (and change) CRASHED logic

Any single OSD from a given interval surviving is sufficient to ensure
that an ACKed write during that interval was committed to disk.

Currently, at least.

In any case, update the prior set calculation to reflect that.  Also
make the survival conditional a bit smarter, to include both last_clean
interval (from the OSDs previous interval of up-ness) as well as the
current interval [up_from..up_thru).

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: PgPriorSet: update comment terms a bit
Sage Weil [Tue, 18 Oct 2011 01:57:16 +0000 (18:57 -0700)]
osd: PgPriorSet: update comment terms a bit

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: do not short-cut up_thru update for new PGs
Sage Weil [Tue, 18 Oct 2011 00:51:53 +0000 (17:51 -0700)]
osd: do not short-cut up_thru update for new PGs

Commit e731885d2550ee985bf875ab5bb5faf28f1693eb made it possible for
a new PG to go active without forcing the OSDs up_thru to update.
This was motivated by the desire for PG creation by radosgw to go
faster.  Radosgw no longer creates a pool per bucket, so this is not
useful there, and it is unclear what other application (that is not
abusing rados pools) would need it.

Since it complicates the prior set calculation for dubious reasons,
let's revert it.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: PgPriorSet: remove unused PG member
Sage Weil [Tue, 18 Oct 2011 00:43:06 +0000 (17:43 -0700)]
osd: PgPriorSet: remove unused PG member

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: PgPriorSet: clean up comments a bit
Sage Weil [Tue, 18 Oct 2011 00:42:48 +0000 (17:42 -0700)]
osd: PgPriorSet: clean up comments a bit

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: PgPriorSet: clean up per-interval var names
Sage Weil [Tue, 18 Oct 2011 00:39:47 +0000 (17:39 -0700)]
osd: PgPriorSet: clean up per-interval var names

We don't actually use any_lost_now, but it makes the logic easier
to understand to have it there.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: PgPriorSet: revert start_since_joining check
Sage Weil [Tue, 18 Oct 2011 00:44:32 +0000 (17:44 -0700)]
osd: PgPriorSet: revert start_since_joining check

Commit 5b78f5db8c200edcc949033e1badae70fecd2e08 added a check to
prevent some sort of badness when osds were marked lost, but I can't
figure out what it was.  Remove the check for now until we can
reproduce/observe the badness in practice, and then write a test and
better motivated/docomented fix.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: PgPriorSet: do not include UP osds in prior.cur
Sage Weil [Tue, 18 Oct 2011 00:02:23 +0000 (17:02 -0700)]
osd: PgPriorSet: do not include UP osds in prior.cur

The up osds are not (directly) relevant since they are not necessarily
members of the PG.  We only care about acting OSDs, which may have
committed writes to the PG during this past interval.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: PgPriorSet: remove up_thru crap
Sage Weil [Mon, 17 Oct 2011 23:54:03 +0000 (16:54 -0700)]
osd: PgPriorSet: remove up_thru crap

This was added way back in 1cf9bebc8e5063f5f311d33e7735bcc9286e98ce,
but as far as I can tell it didn't make any sense then either.

The issue: we redo the prior set calculation if the up_thru for these
OSDs changes in the current map, but the prior_set result does not
depend on the current map's up_thru values in any way; it only depends
on the up_thru in the last epoch of each past interval, and that is
fixed in the past.

Remove the cruft.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: PgPriorSet: any_survived -> any_is_alive_now
Sage Weil [Mon, 17 Oct 2011 23:48:55 +0000 (16:48 -0700)]
osd: PgPriorSet: any_survived -> any_is_alive_now

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agodoc: Change diagram to have radosgw closer to direct rados access.
Tommi Virtanen [Mon, 17 Oct 2011 23:13:14 +0000 (16:13 -0700)]
doc: Change diagram to have radosgw closer to direct rados access.

Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
13 years agostreamtest: do mkfs
Sage Weil [Mon, 17 Oct 2011 21:13:40 +0000 (14:13 -0700)]
streamtest: do mkfs

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agostreamtest: print to stdout
Sage Weil [Mon, 17 Oct 2011 21:12:48 +0000 (14:12 -0700)]
streamtest: print to stdout

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agomkcephfs: copy ceph.conf to /etc/ceph/ceph.conf (when -a)
Sage Weil [Mon, 17 Oct 2011 17:49:46 +0000 (10:49 -0700)]
mkcephfs: copy ceph.conf to /etc/ceph/ceph.conf (when -a)

You can disable this with --no-copy-conf.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoceph.spec: don't chkconfig
Sage Weil [Mon, 17 Oct 2011 15:51:47 +0000 (08:51 -0700)]
ceph.spec: don't chkconfig

This was fighting with suse insserv.  Still needs some cleanup.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoceph.spec: work around build.opensuse.org
Sage Weil [Mon, 17 Oct 2011 15:50:54 +0000 (08:50 -0700)]
ceph.spec: work around build.opensuse.org

The redhat-rpm-config isn't installed on build.opensuse.org, which means
the processor is set to i386 instead of something less ancient.  This
breaks compilation on 32-bit x86.

Kludge around it here.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoceph.spec: capitalize first letter to make rpmlint happy
Sage Weil [Mon, 17 Oct 2011 15:49:04 +0000 (08:49 -0700)]
ceph.spec: capitalize first letter to make rpmlint happy

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agov0.37 v0.37
Sage Weil [Mon, 17 Oct 2011 15:35:57 +0000 (08:35 -0700)]
v0.37

13 years agoosd: fix assemble_backlog
Sage Weil [Sun, 16 Oct 2011 23:07:12 +0000 (16:07 -0700)]
osd: fix assemble_backlog

This was written assuming that le->prior_version wouldn't be the version
that we have locally on disk.  Not always true!

If it is the same, then we can just keep the entry (and clear reqid).  If
it is different, keep the behavior we had (re-add, erase current).

FWIW the last time this was touched was 916b1998.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: fix add_next_event Missing::item::have
Sage Weil [Mon, 17 Oct 2011 03:37:28 +0000 (20:37 -0700)]
osd: fix add_next_event Missing::item::have

The missing set should be accurate up to the current point in the log.  The
log_tail has no bearing on that, nor does last_update, since we're
processing new events in forward order, and updating missing as we go.

Drops the now unused info argument... :/

This more or less reverts b418896d.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoceph: don't crash when sending message to !up osd
Sage Weil [Sat, 15 Oct 2011 05:56:06 +0000 (22:56 -0700)]
ceph: don't crash when sending message to !up osd

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: pull old version to revert to
Sage Weil [Thu, 13 Oct 2011 23:28:53 +0000 (16:28 -0700)]
osd: pull old version to revert to

If we are the primary, and are doing a LOST_REVERT, pull the old version
of the object and update the version when we get it.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: implement lost_revert
Sage Weil [Thu, 13 Oct 2011 20:03:09 +0000 (13:03 -0700)]
osd: implement lost_revert

Roll back to the last available version of an object.  If there is no
available version, delete it.

Leave the door open for other approaches later.

Currently this only works if the prior version is on the primary.  If it is
on another node, we don't pull it yet.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: simplify share_pg_log
Sage Weil [Thu, 13 Oct 2011 19:57:54 +0000 (12:57 -0700)]
osd: simplify share_pg_log

Use Log::copy_after().  Drop the useless argument.  Strip out the broken
LOST logic.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: fix up PG::Missing methods a bit
Sage Weil [Thu, 13 Oct 2011 19:56:28 +0000 (12:56 -0700)]
osd: fix up PG::Missing methods a bit

Pass in iterators when possible.  Stack methods instead of duplicating
functionality.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: factor out recover_primary_got() helper
Sage Weil [Thu, 13 Oct 2011 19:53:59 +0000 (12:53 -0700)]
osd: factor out recover_primary_got() helper

This handles the missing set and lsat_complete adjustment when we recover
an object on the primary.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: make C_OSD_CommittedPushedObject::op optional
Sage Weil [Thu, 13 Oct 2011 19:52:13 +0000 (12:52 -0700)]
osd: make C_OSD_CommittedPushedObject::op optional

This lets us reuse this helper for commiting recovery ops that aren't a
result of a push.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: pass version explicitly to pull
Sage Weil [Thu, 13 Oct 2011 19:51:08 +0000 (12:51 -0700)]
osd: pass version explicitly to pull

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: fix share_pg_log()
Sage Weil [Thu, 13 Oct 2011 19:48:41 +0000 (12:48 -0700)]
osd: fix share_pg_log()

We need to handle a log message in the ReplicaActive state.  And set the
epoch properly when we send it.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agomessages/MOSDPG*: clean up output a bit
Sage Weil [Thu, 13 Oct 2011 19:42:47 +0000 (12:42 -0700)]
messages/MOSDPG*: clean up output a bit

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: remove superfluous write_info calls
Sage Weil [Thu, 13 Oct 2011 18:51:41 +0000 (11:51 -0700)]
osd: remove superfluous write_info calls

- merge_log() will write_info (and log) as needed
- Activate() will do the same

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: all_unfound_are_queried_or_lost
Sage Weil [Thu, 13 Oct 2011 19:47:57 +0000 (12:47 -0700)]
osd: all_unfound_are_queried_or_lost

The check to make isn't whether all locations are lost, but whether all
locations are either lost or have been queried and don't have the object(s)
we want.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: adjust LOST log entry types; simplify log entry type strings
Sage Weil [Wed, 12 Oct 2011 16:20:54 +0000 (09:20 -0700)]
osd: adjust LOST log entry types; simplify log entry type strings

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: fix up mark_all_unfound_lost so that it actually works
Sage Weil [Thu, 6 Oct 2011 04:30:47 +0000 (21:30 -0700)]
osd: fix up mark_all_unfound_lost so that it actually works

Well, it works given our weak definition of LOST.

- use ObjectContexts properly
- move into ReplicatedPG
- no need for _as_ in name

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
13 years agoosd: implement 'flush_pg_stats' command
Sage Weil [Sat, 15 Oct 2011 03:20:11 +0000 (20:20 -0700)]
osd: implement 'flush_pg_stats' command

This flushes the current pg stats to the monitor, and blocks until the
monitor commits it.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: process commands in a workqueue
Sage Weil [Sat, 15 Oct 2011 03:19:38 +0000 (20:19 -0700)]
osd: process commands in a workqueue

This lets us do commands that can potentially block.  For example:

 - flush pg stats to osd
 - request (and wait for) latest osdmap

Currently the threadpool only has 1 thread.  i.e., one concurrent command.
That should be fine, methinks.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agomon: feed MPGStats tids back through the MPGStatsAck
Sage Weil [Fri, 14 Oct 2011 20:55:57 +0000 (13:55 -0700)]
mon: feed MPGStats tids back through the MPGStatsAck

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: remove some pg stats debug cruft
Sage Weil [Fri, 14 Oct 2011 20:44:43 +0000 (13:44 -0700)]
osd: remove some pg stats debug cruft

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: handle (and reply to) direct MCommands
Sage Weil [Wed, 12 Oct 2011 17:44:19 +0000 (10:44 -0700)]
osd: handle (and reply to) direct MCommands

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agocephtool: ability to send commands directly to osds
Sage Weil [Wed, 12 Oct 2011 17:38:23 +0000 (10:38 -0700)]
cephtool: ability to send commands directly to osds

This makes commands beginning with 'tell <target>' magic in that they go
to the given target instead of to the monitor.  This is slightly odd, but
I think it gives the most natural interface for the user, with the tool
Doing The Right Thing for you.  E.g.,

 ceph tell <someone> something      (direct to some daemon)
 ceph do something                  (goes to monitor to do X)

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agomsg: entity_name_t::parse()
Sage Weil [Wed, 12 Oct 2011 03:57:24 +0000 (20:57 -0700)]
msg: entity_name_t::parse()

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
13 years agomsg: add MCommand, MCommandReply message types
Sage Weil [Wed, 12 Oct 2011 00:51:07 +0000 (17:51 -0700)]
msg: add MCommand, MCommandReply message types

These are similar to MMonCommand[Ack], but aren't PaxosServiceMessage
children, don't include the command in the reply (useless), have a more
generic name.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
13 years agopaxos: trim extra state dirs
Sage Weil [Fri, 14 Oct 2011 04:26:13 +0000 (21:26 -0700)]
paxos: trim extra state dirs

OSDMonitor, for instance, stores both an "osdmap" and "osdmap_full" for
each state.  Trim them both.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
13 years agoPG: call set_last_peering_reset in Started contructor
Samuel Just [Fri, 14 Oct 2011 19:59:42 +0000 (12:59 -0700)]
PG: call set_last_peering_reset in Started contructor

Calling it here should cover all possible replica and primary peering
resets.

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
13 years agofilestore: assert on any unexpected error
Sage Weil [Fri, 14 Oct 2011 23:49:28 +0000 (16:49 -0700)]
filestore: assert on any unexpected error

Right now, the only errors we expect out of the underlying filesystem are
-ENOENT, -ENODATA, or (as a workaround for extN xattr suckage) -ENOSPC
for certain setxattr operations.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: send full map if we don't have sufficiently old incremental
Sage Weil [Fri, 14 Oct 2011 20:28:51 +0000 (13:28 -0700)]
osd: send full map if we don't have sufficiently old incremental

If the peer has a really old map, send a full map instead of crashing
because we are missing the needed incremental.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agomon: make number of old paxos states configurable
Sage Weil [Fri, 14 Oct 2011 20:30:43 +0000 (13:30 -0700)]
mon: make number of old paxos states configurable

Currently settable on osdmaps, pgmaps, and log.  Still need MDSMap and
authmap trimming.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: share oldest_map info with peers
Sage Weil [Fri, 14 Oct 2011 20:29:33 +0000 (13:29 -0700)]
osd: share oldest_map info with peers

This helps OSDs trim their old maps even when they don't get MOSDMap
messages directly from the monitor.

It also feeds that information to clients, although they don't use it yet.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoosd: send full map if we don't have sufficiently old incremental
Sage Weil [Fri, 14 Oct 2011 20:28:51 +0000 (13:28 -0700)]
osd: send full map if we don't have sufficiently old incremental

If the peer has a really old map, send a full map instead of crashing
because we are missing the needed incremental.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agopaxos: trim extra state dirs
Sage Weil [Fri, 14 Oct 2011 04:26:13 +0000 (21:26 -0700)]
paxos: trim extra state dirs

OSDMonitor, for instance, stores both an "osdmap" and "osdmap_full" for
each state.  Trim them both.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
13 years agoosd: remove old osdmaps
Sage Weil [Fri, 14 Oct 2011 04:14:09 +0000 (21:14 -0700)]
osd: remove old osdmaps

When the monitor removes old maps, we should too.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
13 years agoPG: call set_last_peering_reset in Started contructor
Samuel Just [Fri, 14 Oct 2011 19:59:42 +0000 (12:59 -0700)]
PG: call set_last_peering_reset in Started contructor

Calling it here should cover all possible replica and primary peering
resets.

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
13 years agoPG: Fix log.empty confusion
Samuel Just [Thu, 13 Oct 2011 21:35:13 +0000 (14:35 -0700)]
PG: Fix log.empty confusion

Previously, log.empty meant that the log.head was everion_t().  However,
it was in a few places used to mean that log.head == log.tail.  Now,
log.empty means log.head == log.tail and log.null() indicates that
log.head is eversion_t().

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
13 years agoPG: Fix log.empty confusion
Samuel Just [Thu, 13 Oct 2011 21:35:13 +0000 (14:35 -0700)]
PG: Fix log.empty confusion

Previously, log.empty meant that the log.head was everion_t().  However,
it was in a few places used to mean that log.head == log.tail.  Now,
log.empty means log.head == log.tail and log.null() indicates that
log.head is eversion_t().

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
13 years agomakefile changes for interval tree
Jojy George Varghese [Wed, 12 Oct 2011 06:47:10 +0000 (23:47 -0700)]
makefile changes for interval tree

Added unit test case for interval tree to the makefile template.

Signed-off-by: Jojy George Varghese <jvarghese@scalecomputing.com>
13 years agomds: Unit tests for interval tree
Jojy George Varghese [Wed, 12 Oct 2011 06:26:13 +0000 (23:26 -0700)]
mds: Unit tests for interval tree

Provides usage scenarios and test cases for interval tree
implementation.

Tests include:
 - testing addInterval interface
 - testing removeInterval interfaces
 - testing with various template parameters

Signed-off-by: Jojy George Varghese <jvarghese@scalecomputing.com>
13 years agomds: Interval tree implementation
Jojy George Varghese [Wed, 12 Oct 2011 06:17:16 +0000 (23:17 -0700)]
mds: Interval tree implementation

Interval tree is an optimized data structure for representing and
querying intervals. Elementary intervals are represented as nodes of an
avl tree and the corresponding data is stored on these nodes based on a
concept of span. This representation allows log(n) (where n is the
number of data) storage. The balanced avl tree allows a log(n) query.
The implementation is a template class that is instantiated based on
parameters : - Interval type - Data type

Signed-off-by: Jojy George Varghese <jvarghese@scalecomputing.com>
13 years agoauth: remove global instance of auth_supported
Sage Weil [Thu, 13 Oct 2011 20:28:41 +0000 (13:28 -0700)]
auth: remove global instance of auth_supported

Wrap it in a class.

Instantiate locally, or keep a copy around if we'll need it often.

Factor out the protocol selection into an AuthSupported method.  Prefer
larger ids, for lack of a better policy.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
13 years agoosd: bound generate_past_intervals() by oldest map
Sage Weil [Thu, 13 Oct 2011 16:53:41 +0000 (09:53 -0700)]
osd: bound generate_past_intervals() by oldest map

The oldest osdmap we maintain is a lower bound on last_epoch_clean for the
entire system (assuming the monitor is doing it's job right).  We can stop
generating past intervals when we hit it.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
13 years agocls_rgw: remove the write_bucket_dir function.
Greg Farnum [Wed, 12 Oct 2011 00:02:20 +0000 (17:02 -0700)]
cls_rgw: remove the write_bucket_dir function.

It's no longer called anywhere. Hurray, we don't do our own
read-modify-write cycle any more (and can exploit the power of
btrees later)!

Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
13 years agocls_rgw: rewrite rgw_bucket_complete_op to use update.
Greg Farnum [Wed, 12 Oct 2011 23:37:55 +0000 (16:37 -0700)]
cls_rgw: rewrite rgw_bucket_complete_op to use update.

Unfortunately we can't do multiple writes via the interface -- the
second one will clobber the first one. So use the update functionality
and go through that pain instead.

Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
13 years agocls_rgw: refactor rgw_bucket_complete_op in terms of TMAP
Greg Farnum [Tue, 11 Oct 2011 23:58:10 +0000 (16:58 -0700)]
cls_rgw: refactor rgw_bucket_complete_op in terms of TMAP

Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
13 years agocls_rgw: refactor rgw_bucket_prepare_op in terms of tmap
Greg Farnum [Tue, 11 Oct 2011 23:28:22 +0000 (16:28 -0700)]
cls_rgw: refactor rgw_bucket_prepare_op in terms of tmap

Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
13 years agocls_rgw: refactor rgw_bucket_init_index in terms of tmap
Greg Farnum [Tue, 11 Oct 2011 22:51:08 +0000 (15:51 -0700)]
cls_rgw: refactor rgw_bucket_init_index in terms of tmap

Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
13 years agocls_rgw: refactor read_bucket_dir in terms of tmap.
Greg Farnum [Tue, 11 Oct 2011 22:50:33 +0000 (15:50 -0700)]
cls_rgw: refactor read_bucket_dir in terms of tmap.

This function won't be called often once refactoring is done, but
its functionality will be needed for listing, if nothing else.

Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>