Samuel Just [Wed, 30 Mar 2011 20:14:55 +0000 (13:14 -0700)]
mkcephfs: copy to daemon nodes for each daemon
The tmp directory is removed after each daemon. Previously, this would
break if two daemons were on the same node. Now, the files will be
copied for each daemon.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Wed, 30 Mar 2011 23:46:04 +0000 (16:46 -0700)]
journaler: don't block when we adjust back write_pos
is_readable() may need to adjust the write_pos backward, but will return
false. If we are at the end we still need to wake up any waiters so they
know about it.
md_config_t::parse_argv: fold md_config_t::parse_argv_part2 into
parse_argv. Fix brokenness introduced by the std::string switchover.
OPTION macro: move single-character options out of the OPTION macro and
into config.cc
Fix ceph_argparse_witharg / ceph_argparse_flag uses to include a
trailing (char*)NULL, to ensure that we terminate with a pointer rather
than a 32-bit int.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Samuel Just [Wed, 30 Mar 2011 20:14:55 +0000 (13:14 -0700)]
mkcephfs: copy to daemon nodes for each daemon
The tmp directory is removed after each daemon. Previously, this would
break if two daemons were on the same node. Now, the files will be
copied for each daemon.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Use std::string to represent md_config_t strings. This makes memory
management a lot easier and should fix some leaks. "No value" is now
represented by an empty string, whereas before some places were using
empty strings and some were using NULL.
config.cc: Fix a minor decode bug.
In pid_file.cc, copy the pid_file using snprintf, since strncpy
does not always NULL-terminate.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Move parsing into config.cc, since there was already parsing code there.
Move metavariable escaping out of ConfUtils; having this in ConfUtils
makes it impossible to de-globalize g_conf.
Create a nicer API for pulling stuff out of the configuration file.
Since the value we pull is determined by the config structure in effect
at the time, it should be an instance method of md_config_t.
Remove some deadcode. Add some comments.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Tommi Virtanen [Tue, 29 Mar 2011 16:21:09 +0000 (09:21 -0700)]
common: Make armor.h safe to use from C.
mount.ceph needs to base64-decode the secrets, so we can get rid of
the kernel-side base64 decode, but it doesn't need all of common lib.
And it is written in C.
Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Tommi Virtanen [Tue, 29 Mar 2011 00:32:24 +0000 (17:32 -0700)]
mount.ceph: Modprobe ceph before trying the mount.
This will be needed for the next few commits, where we try to load the
keys into the kernel; without ceph.ko loaded, the key type will not be
recognized.
Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Sage Weil [Tue, 29 Mar 2011 18:58:13 +0000 (11:58 -0700)]
cmon: add --inject-monmap option
This lets you manually inject a monmap into a down monitor. This is useful
in cases where you need to change the monmap but aren't able to get a
quorum with the old map.
Tommi Virtanen [Mon, 28 Mar 2011 22:45:45 +0000 (15:45 -0700)]
vstart.sh: Filter out IPv6 and localhost IP addresses.
On e.g. Ubuntu 10.10, hostname --ip-address outputs something
like "::1 10.1.2.3 127.0.1.1", and this makes the generated
config be invalid. Get rid of the entries we can't use.
Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Sage Weil [Fri, 25 Mar 2011 21:34:17 +0000 (14:34 -0700)]
mds: include .ceph is root directory
If the dentry isn't marked dirty _commit_partial won't save it. This is
caught later by the check_rstats() (or anyone actually trying to use the
/.ceph directory).
Fixes: #938 Signed-off-by: Sage Weil <sage@newdream.net>
Since NULL is really just a macro defined to be 0, we must use
(char*)NULL or similar to force the compiler to use a true pointer value
as the last argument to the run_cmd varargs function. Otherwise, the 0
gets promoted to an int, which probably is not the same length as a
pointer these days (32 vs. 64.)
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Sage Weil [Fri, 25 Mar 2011 20:50:45 +0000 (13:50 -0700)]
mds: fix client session removal on journal replay
We want to remove the client session from the map as long as it is not
attached to an actual messenger Connection. This key point got lost
somewhere the last time the session states were restructured. It is now
explicit.
This fixes the symptom where a recovering MDS reconnect has to time out on
clients that cleanly closed their sessions.
Also, fix a use-after-free when (uselessly) printing the session state.
Sage Weil [Fri, 25 Mar 2011 19:37:48 +0000 (12:37 -0700)]
journaler: remove ack/safe distinction
Rip out old complexity to _only_ pay attention to when data is safely
committed on disk. No more ack/safe distinction or ack_barrier complexity
(to preserve ordered with some submissions waiting on ack and some safe).
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 25 Mar 2011 16:51:53 +0000 (09:51 -0700)]
journaler: issue separate reads per period
This lets us potentially digest any read data as soon as possible. Before
the Filer would issue a string of reads and we'd only get the data back
once _all_ those objects were read.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 25 Mar 2011 16:30:16 +0000 (09:30 -0700)]
journler: make readahead/prefetch smarter
Always try to prefetch N segments ahead of the current read position. The
old implementation would read a bunch of data, process it all, then read
a bunch more. This was suboptimal on a couple different levels.
Also, make an internal _is_readable() _not_ do the prefetch step; only do
that for external callers.
Fixes: #929 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 24 Mar 2011 03:58:03 +0000 (20:58 -0700)]
mds: remove mds_log_unsafe mode
The mds_log_unsafe mode would wait for ack for some journal writes, and
safe for others. Now that we can reply to client requests without waiting
for the journal to flush (as of ~2 years ago), this distinction is no
longer useful. It is also more error-prone, as it complicates the code
and vastly expands the possible combinations of MDS failures and replay
scenarios we need to verify.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 24 Mar 2011 00:17:44 +0000 (17:17 -0700)]
mds: reimplement laggy
The goal is for the MDS to stop processing requests when it hasn't heard
from the monitors, to avoid a situation where a rogue process goes off
doing its own thing. Yes, if we fail it over the cmds can't write to the
object store, but it can reply to clients when it may not be appropriate
or good to do so.
The old logic was fragile and wonky, with messages getting deferred, and
then re-deferred. This implementation is much cleaner and should be much
more efficient and less fragile. There are still improvements to be made
as far as which messages we do/do not process when we think we're laggy.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 24 Mar 2011 03:37:04 +0000 (20:37 -0700)]
mds: skip redundant flush before journal segment trim
Back in olden times when we would would wait for acks for some journal
writes, we did an extra wait_for_safe() before discarding a journal segment
to make sure anything being discarded was safely committed in newers
segments. These days mds_log_unsafe is always false (and
journaler_safe is true), so we can skip this check.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 22 Mar 2011 04:38:36 +0000 (21:38 -0700)]
osd: factor pg get-or-create code into common helper
handle_pg_notify and _process_pg_info both lookup or create a PG based
on an incoming message. Factor that code into a common helper. There
were a few differences in that the pg notify handler code deals with
more cases (namely, pg creation), but this is harmless for the more
general _process_pg_info caller.
Closes: #577 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Samuel Just [Tue, 22 Mar 2011 21:52:15 +0000 (14:52 -0700)]
FileStore: replace op_queue_throttle with op_queue_reserve_throttle
Previously, queue_op would call op_queue_throttle while holding the
journal_lock. op_queue_throttle, however, can sleep.
We fix the problem by:
1) Factor build_op out of queue_op
2) op_queue_throttle is now op_queue_reserve_throttle and takes an op as
an argument. op_queue_reserve_throttle can be called before the journal
lock is taken. This also avoids the race between calling throttle and
incrementing op_queue_bytes and op_queue_len.
3) queue_op now takes the op generated using build_op as an argument.
4) _journaled_ahead no longer needs to call throttle as
queue_transactions has already reserved space.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>