Sage Weil [Thu, 24 Mar 2011 00:17:44 +0000 (17:17 -0700)]
mds: reimplement laggy
The goal is for the MDS to stop processing requests when it hasn't heard
from the monitors, to avoid a situation where a rogue process goes off
doing its own thing. Yes, if we fail it over the cmds can't write to the
object store, but it can reply to clients when it may not be appropriate
or good to do so.
The old logic was fragile and wonky, with messages getting deferred, and
then re-deferred. This implementation is much cleaner and should be much
more efficient and less fragile. There are still improvements to be made
as far as which messages we do/do not process when we think we're laggy.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 24 Mar 2011 03:37:04 +0000 (20:37 -0700)]
mds: skip redundant flush before journal segment trim
Back in olden times when we would would wait for acks for some journal
writes, we did an extra wait_for_safe() before discarding a journal segment
to make sure anything being discarded was safely committed in newers
segments. These days mds_log_unsafe is always false (and
journaler_safe is true), so we can skip this check.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 22 Mar 2011 04:38:36 +0000 (21:38 -0700)]
osd: factor pg get-or-create code into common helper
handle_pg_notify and _process_pg_info both lookup or create a PG based
on an incoming message. Factor that code into a common helper. There
were a few differences in that the pg notify handler code deals with
more cases (namely, pg creation), but this is harmless for the more
general _process_pg_info caller.
Closes: #577 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Samuel Just [Tue, 22 Mar 2011 21:52:15 +0000 (14:52 -0700)]
FileStore: replace op_queue_throttle with op_queue_reserve_throttle
Previously, queue_op would call op_queue_throttle while holding the
journal_lock. op_queue_throttle, however, can sleep.
We fix the problem by:
1) Factor build_op out of queue_op
2) op_queue_throttle is now op_queue_reserve_throttle and takes an op as
an argument. op_queue_reserve_throttle can be called before the journal
lock is taken. This also avoids the race between calling throttle and
incrementing op_queue_bytes and op_queue_len.
3) queue_op now takes the op generated using build_op as an argument.
4) _journaled_ahead no longer needs to call throttle as
queue_transactions has already reserved space.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Greg Farnum [Fri, 18 Mar 2011 21:25:45 +0000 (14:25 -0700)]
MDCache: make linkunlink rstat propagation work properly.
We could be in a lock state (ie, gather) where we can't take new locks.
But if we're in this function for linkunlink we have to already have
a lock, so in that case make sure the function succeeds and assert
that we do have a lock.
BuildRequires: cryptopp-devel has been replaced by nss-devel. Skip
google-perftools-devel because that package is not available for x86-64.
Add python.
Don't install libcls_rbd.so.1.0.0.debug.
Package crbdnamer and librados-config.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
mkostemps isn't present in older glibc versions, like the ones in CentOS
5.5. We don't really use any of the extra functionality of mkostemps in
this test.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Samuel Just [Tue, 15 Mar 2011 00:25:46 +0000 (17:25 -0700)]
PG,OSD: activate pg during replay
Replay PGs already accept and queue transactions. PGs will now go to
active during replay in order to simplify the state reported to the user
and to allow recovery to being.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Tommi Virtanen [Mon, 14 Mar 2011 18:52:44 +0000 (11:52 -0700)]
blobhash: Avoid size_t in templatized hash functions.
On S/390, the earlier rjhash<size_t> failed with
"no match for call to '(rjhash<long unsigned int>) (size_t&)'".
It seems the rjhash<size_T> logic was only enabled
on some architectures, and relied on some pretty deep
internals of the bit layout (__LP64__).
Use an explicitly 32-bit type as early as possible, and
convert back to size_t only when really needed. This
should work, and simplifies the code. In theory, we might
have a narrower output (size_t might be 64-bit, max value
we now output is 32-bit), but this doesn't matter as this
is only ever used for picking a slot in an in-memory hash
table, hash(key) modulo num_of_buckets, there won't be >4G
buckets.
Closes: #837 Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Sage Weil [Thu, 17 Mar 2011 18:32:56 +0000 (11:32 -0700)]
msgr: let user explicitly set nonce
There will be problems if two messengers use the same entity_addr_t because
they are on the same ip and choose the same nonce (e.g., because they are
in the same process). Let the caller sort this out in whatever way it
finds most appropriate.
For libceph, librados, and csyn, all N million to the pid.
Fixes: #877 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Wed, 16 Mar 2011 21:39:24 +0000 (14:39 -0700)]
common: disable log_per_instance for non-daemons
Turn off the logging and symlink rotation, not just symlink rotation.
This is a somewhat arbitrary distinction (log per instance only for
daemons), but its only used by vstart and only really useful for
development/debugging, so who cares.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>