Samuel Just [Tue, 20 Sep 2011 20:34:31 +0000 (13:34 -0700)]
OSD: return NULL when the OSD does not have the pg in lookup_lock_raw_pg
Previously, we returned NULL if the osd lacked the pool, but not if the
osd had the pool and lacked the pg. In that case, the assert in
_lookup_lock_pg would crash the osd.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Tue, 20 Sep 2011 17:58:20 +0000 (10:58 -0700)]
osd: remove throttle_op_queue()
There are subtle annoying problems with throttling and requeueing, and
throttling at this particular point in the stack makes little sense
anyway. We have
- messenger queue. throttled based on total bytes/payload
- op_queue, throttled before we queue items.
There is no real value in throttling a message before checking whether it
is valid (sent to the right osd, etc.) or putting it on the op_queue,
where it will sit until a worker thread picks it up and processes it.
When we get an osd_map, for instance, we pause op_queue, requeue
everything on the op_queue for reprocessing, and do the map update, so
not having a load of messages on that queue doesn't hurt us. It just
complicates requeueing in the throttle_op_queue case, and delays the
checks for non-existent PGs or misdirected requests.
Sage Weil [Tue, 20 Sep 2011 01:23:10 +0000 (18:23 -0700)]
osd: preserve ordering when throttling races with missing/degraded requeue
When we delay an op because the op_queue is full, we can violate the op
order:
- op1 comes in, waits because object is missing
- op2 comes in, throttles on op queue
- op1 is requeued (no longer missing)
- queue drains, op2 happens
- op1 happens
To avoid this, if we delay, requeue ourselves... after whatever else is
on the queue.
Fixes: #1490 Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Mon, 19 Sep 2011 23:50:48 +0000 (16:50 -0700)]
osd: set reply version for dup requests
If we get a dup request, set the version in the reply. That means the
client knows the client was successful and committed, and they know the
version. They don't get anything else (e.g., data payload resulting from
mutations).
Tommi Virtanen [Mon, 19 Sep 2011 16:47:11 +0000 (09:47 -0700)]
ceph_common.sh: Do not sudo to root unless needed
Using do_root_cmd() doesn't really need to sudo to root
if you're already root.
Commit 71dc75bdafe62a098c0493ad62f2d0d2a6ca7946 causes a regression:
when system "foo" has a sudoers config that requires a tty,
init-ceph now fails like this:
sudo: sorry, you must have a tty to run sudo
when it is invoked by root with something like this:
ssh foo /etc/init.d/init-ceph start
Signed-off-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Several functions examine argv in order to set options. Only the last
argument parsing pass should remove the '--' from the argument vector.
If it is removed earlier than that, entries may be parsed as options,
when that was not the user's intent.
This changes fixes the common argument parsing loops so that they do not
remove the double dash. It also rearranges some programs so that the
user's argument parsing loop comes last, rather than coming before the
common argument parsing loops.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Sage Weil [Tue, 13 Sep 2011 04:23:00 +0000 (21:23 -0700)]
monclient: reopen session on monmap change
If our cur_mon is removed from the monmap, reopen the session. Do not
call _pick_new_mon() directly or we won't reset state, won't
reauthenticate, etc.
Instead of having global CompatSet objects, just have functions that can
return appropriate CompatSet objects. This avoids global constructor
and destructor ordering issues.
Fixes bug #1512
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Sage Weil [Mon, 12 Sep 2011 02:19:32 +0000 (19:19 -0700)]
librbd: implement rbd buffered write window
Normal disks have a write cache and acknowledge writes before they reach
the platter. Among other things, this masks write latency. A flush
operation is needed when the user really cares that the writes are stable.
Implement a librbd write window that allows a window including the most
recent N bytes of writes to be immediately acked. An flush operation
blocks while they are pushed out to disk.
This differs from the typical disk in that writes are always immediately
sent to the backend store, while disks will buffer small writes for a time
(and, in fact, can be made to hold small writes in the cache indefinitely
under certain workloads).
Thus, 'rbd_writeback_window' may be a bit of a misnomer...
Currently this applies only to aio writes, not sync writes. That could
most easily be fixed by reimplementing write in terms of aio_write.
Sage Weil [Mon, 12 Sep 2011 01:58:42 +0000 (18:58 -0700)]
client: fix odd crash on rename
If the old_dentry is in the same dir, and it is the last dentry, we need
to keep the dir open.
This is hard to hit because the rename itself will typically instantiate
a null dentry on the target, and it's hard to construct a working where
a racing process makes us drop it. Fortunately this was triggered
reliably by the snaptest-git-ceph.sh workunit.
Fixes: #1519 Signed-off-by: Sage Weil <sage@newdream.net>
Samuel Just [Sat, 10 Sep 2011 04:39:27 +0000 (21:39 -0700)]
PG: generate backlog when confronted with corrupt log
Currently we throw out the log and start up anyway. With this change, we
would throw out the log, generate a fresh backlog, and then start up.
That may not be the best possible thing, but it's better than what we
currently do. Indirectly fixes #1502.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Tommi Virtanen [Fri, 9 Sep 2011 23:25:14 +0000 (16:25 -0700)]
man: Generate manpages from doc/man.
Keeping the generated files in version control lets us
support builds from scratch without requiring the full
documentation toolchain to be installed.
The files were just copied over from build-doc/output/man,
after a ./admin/build-doc call. When redoing this, also
take care to remove any roff output if a file was removed
from doc/man, and update Makefile.am.
Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
We were previously setting up a reference loop. But the only way
to get Sessions is via the Connection, so let's just give Sessions
the pointer, and give Connections a counted ref.
We can't do that if we're trying to be Valgrind-clean, so just
make the lock name part of the class.
As best I can tell, that ordered initialization is safe because
data members are initialized in the order they are declared. See eg
http://xenon.arcticus.com/c-morsels-initializer-list-execution-order
Sage Weil [Wed, 7 Sep 2011 20:28:21 +0000 (13:28 -0700)]
osd: take ondisk_read_lock on src_oids
We need to take the ondisk read lock on src oids for multiobject operations
(like clonerange) to ensure that written data has hit disk before we
clone it elsewhere.
Order of acquisition doesn't actually matter here, since the ondisk locks
are all leaves in the lock dependency hierarchy.