Samuel Just [Tue, 6 Dec 2011 21:23:03 +0000 (13:23 -0800)]
ReplicatedPG: don't crash on empty data_subset in sub_op_push
If data_subset is empty (i.e., the data we pulled is no longer useful),
we should mark complete false and continue rather than fail the
assert in range_end().
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Greg Farnum [Tue, 6 Dec 2011 22:24:08 +0000 (14:24 -0800)]
ReplicatedPG: do not ->put() scrub messages when adding to a WorkQueue.
This function is passing a reference from PG::active_rep_scrub to
the req_scrub_wq, not eliminating the reference (and the WorkQueue
doesn't grab a new reference itself, either).
The other alternative is to convert the WorkQueue to grab a
reference, but since they can cycle through the WorkQueue more than
once, and need to be ->put() outside the WorkQueue, I don't like
that option.
This should fix #1758.
Also add an assert to PG::_request_scrub_map to check on the other
possible cause of this bug (and fix the indentation).
Tommi Virtanen [Tue, 6 Dec 2011 20:13:03 +0000 (12:13 -0800)]
doc: Reorganize pip calls to use a requirements file.
The conditional before running pip install was unnecessary,
"pip install" on already installed packages is fast (as long
as it's not --upgrade), and --quiet makes it not spam the
console.
Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Sage Weil [Mon, 5 Dec 2011 18:52:24 +0000 (10:52 -0800)]
filejournal: remove bogus check in read_entry
It is perfectly fine to read events that are older than the fs's seq from
the journal; open() will skip them when positioning the read pointer on
open.
Also, this code is nonsensical; it always failed the assertion.
Sage Weil [Mon, 5 Dec 2011 17:34:44 +0000 (09:34 -0800)]
filejournal: set last_committed_seq based on fs, not journal
last_committed_seq is the last seq committed to the fs, not the journal.
Set it when we begin replay with the fs provided value, not from the newest
entry in the journal.
Sage Weil [Fri, 2 Dec 2011 23:35:38 +0000 (15:35 -0800)]
mon: stub perfcounters for monitor, cluster
The 'mon' perfcounter is for the local daemon and is always registered.
The 'cluster' perfcounter is for cluster state, and is only registered
(and thus only shows up via the admin socket) when the current daemon is
part of the cluster quorum.
This could conceivably cause the reply ordering mismatch seen in bug
#1490. Not sure why we didn't also fix this caller when we fixed that
bug last time :).
Sage Weil [Fri, 2 Dec 2011 17:58:45 +0000 (09:58 -0800)]
crush: ignore forcefed input that doesn't exist
This might happen if, e.g., the file_layout specifies an osd that later
is removed from the cluster entirely. Just ignore it instead of making
upper layers duplicate this check.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
And I change my mind.. I think this is most cleanly handled inside crush, so
we don't duplicate the same check that is generating the error with an different
data structure.
Mark Kampe [Thu, 1 Dec 2011 23:58:32 +0000 (15:58 -0800)]
Doc: delete gratuitous index.html
It was not an index, and seems to contain recommendations
for system configuration. I have renamed it to confusing.txt
and will merge it in a future commit.
Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>
which was a copy of PlanningImplementation.txt
(and not html at all).
restored previous index.rst, which was overwritten with a copy
of PlanninImplementation.txt, but removed all of the recursively
included content from the document.
I will cherry-pick merge the new contents in a subsequent commit.
Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>
Sage Weil [Mon, 28 Nov 2011 00:10:46 +0000 (16:10 -0800)]
mon: search for local ip during mkfs
If an address isn't explicitly specified during mkfs, look for an unnamed
monitor in the (generated) monmap and see if any of those addresses is
configured on the local machine. If so, assume it's us, and name ourselves
in the seed monmap.
Samuel Just [Tue, 22 Nov 2011 17:30:35 +0000 (09:30 -0800)]
ReplicatedPG: Account for clone space usage in make_writeable
Previously, we accounted for clone space usage inconsistently in
write_update_size_and_usage etc when walking through the operations.
make_writeable may change the most recent clone overlap, however, so we
can't handle it until then.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Wed, 23 Nov 2011 15:02:41 +0000 (07:02 -0800)]
ceph: fix shutdown race
Shut down MonClient before messenger, to avoid race with MonClient::tick()
and MonClient::shutdown().
Fixes
#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1 0x00007f44475e2849 in _L_lock_953 () from /lib/libpthread.so.0
#2 0x00007f44475e266b in __pthread_mutex_lock (mutex=0x14d8dc8) at pthread_mutex_lock.c:61
#3 0x00000000005ae090 in Mutex::Lock (this=0x14d8db8, no_lockdep=false) at ./common/Mutex.h:108
#4 0x000000000068440e in MonClient::shutdown (this=0x14d8c30) at mon/MonClient.cc:386
#5 0x00000000005b2653 in ceph_tool_common_shutdown (ctx=0x14d84c0) at tools/common.cc:661
#6 0x00000000005ada29 in main (argc=7, argv=0x7fff8a2394c8) at tools/ceph.cc:304
vs
#0 0x00007f44475e8a0b in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
#1 0x00000000005eff6b in reraise_fatal (signum=11) at global/signal_handler.cc:59
#2 0x00000000005f0165 in handle_fatal_signal (signum=11) at global/signal_handler.cc:106
#3 <signal handler called>
#4 0x0000000000000000 in ?? ()
#5 0x000000000068661a in MonClient::tick (this=0x14d8c30) at mon/MonClient.cc:621
#6 0x0000000000689e3b in MonClient::C_Tick::finish(int) ()
#7 0x000000000061b3c5 in SafeTimer::timer_thread (this=0x14d8df8) at common/Timer.cc:102
#8 0x000000000061c6f0 in SafeTimerThread::entry() ()
#9 0x00000000005f1219 in Thread::_entry_func (arg=0x14e1a00) at common/Thread.cc:41
#10 0x00007f44475e0971 in start_thread (arg=<value optimized out>) at pthread_create.c:304
#11 0x00007f4445ead92d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#12 0x0000000000000000 in ?? ()
Tommi Virtanen [Wed, 23 Nov 2011 01:48:40 +0000 (17:48 -0800)]
common/pick_address: Fix IP address stringification.
Different sockaddr_* have the actual address (sin_addr, sin6_addr)
at different offsets, and sockaddr->sa_data just isn't enough.
inet_ntop conspires by taking a void*. I could figure out the right
offset with a switch (found->sa_family), but let's go for the
supposedly write-once-run-with-any-AF solution, getnameinfo.
Which, naturally, takes an extra length argument that is AF-specific,
and not provided anywhere nicely by getifaddrs. Huzzah!
Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Sage Weil [Tue, 22 Nov 2011 18:09:41 +0000 (10:09 -0800)]
mon: mark down all connections when rank changes
The election and some other stuff depend on msg->get_source().num() to get
the peer rank, and that is part of the connection state. If it changes,
we need to close old connections and open new ones so that we aren't
taken for someone else (like mon.-1).
Samuel Just [Mon, 21 Nov 2011 23:06:35 +0000 (15:06 -0800)]
PG: it's not necessary to call build_inc_scrub_map in build_scrub_map
Because we have called osr.flush(), it's safe to tag map.valid_through
as last_update. We will still have to catch up once we have stopped
writes and allowed the filestore to catch up anyway.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>