Sage Weil [Thu, 17 Nov 2011 19:56:37 +0000 (11:56 -0800)]
objecter: set skipped_map if we skip a map
This ensures that we resend _all_ requests, since we aren't sure which
may have mapped to a different primary and then back. This was missed in
the original implementation in 4fe9cca5dd63a1924be2b5cb18f542fb4b97a768.
Sage Weil [Thu, 17 Nov 2011 19:39:36 +0000 (11:39 -0800)]
objecter: send slow osd MPing via Connection*
This may address #1732 indirectly because we have a Connection* reference
here. However, it's still not clear how we ended up with an OSDSession*
for an osd that doesn't exist. :/
Sage Weil [Wed, 16 Nov 2011 18:54:59 +0000 (10:54 -0800)]
mon: always load stashed version when version doesn't match
The slurp process can happen after the monitor has started and has some
in-memory version of the state, and that process may wipe out old
incrementals and change the stashed version. That means that in
update_from_paxos, we need to pull the stashed version if it doesn't
match what we currently have or else we may not have the incrementals we
need to get up to date.
This simplifies and cleans up that code a bit so it is not specific to
monitor startup.
Josh Pieper [Fri, 11 Nov 2011 13:19:55 +0000 (08:19 -0500)]
rgw: Fix some merge problems uncovered by gcc warnings:
* a refactor in e2100bce left the mod_ptr and unmod_ptr members set
incorrectly in RGWCopyObj::init_common
* a fix in 6752babd aggregated error returns, but then failed to do
anything with them
Signed-off-by: Josh Pieper <jjp@pobox.com> Signed-off-by: Sage Weil <sage@newdream.net>
Josh Pieper [Fri, 11 Nov 2011 13:19:02 +0000 (08:19 -0500)]
Resolve gcc warnings.
These should have no functional changes:
* Check errors from functions that currently cannot return any
* Initialize variables that gcc can't determine will be initialized
in a following function call
* Remove unused variables
Signed-off-by: Josh Pieper <jjp@pobox.com> Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Sat, 12 Nov 2011 05:03:09 +0000 (21:03 -0800)]
osd: fix warnings
osd/ReplicatedPG.cc: In member function 'virtual void ReplicatedPG::remove_watchers_and_notifies()':
osd/ReplicatedPG.cc:1167: warning: suggest a space before ';' or explicit braces around empty body in 'for' statement
osd/ReplicatedPG.cc:1176: warning: suggest a space before ';' or explicit braces around empty body in 'for' statement
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 11 Nov 2011 22:52:14 +0000 (14:52 -0800)]
mon: allow monitor to automagically join cluster
If a monitor starts up with the correct fsid and auth keys, it will now
add itself to the monmap (and subsequently try to join the quorum) if it
is not already in the monmap.
Sage Weil [Fri, 11 Nov 2011 20:02:52 +0000 (12:02 -0800)]
mon: properly process monmaps even when i have the latest
We may get the latest monmap when we are doing our probing, but we still
need to process it in update_from_paxos(). Consider get_latest_version()
in addition to the active map.
Sage Weil [Thu, 10 Nov 2011 21:35:57 +0000 (13:35 -0800)]
test_filestore_idempotent: detect commit cycles due to non-idempotent ops
If we do a non-idempotent op and it does a commit itself, we don't see
fs->is_committed() true ever. Also count full commit cycles, and kill
ourselves after several of those have gone by.
Sage Weil [Thu, 10 Nov 2011 21:18:51 +0000 (13:18 -0800)]
filejournal: fix replay of non-idempotent ops
- start sync thread prior to replay, so that we can commit as we replay
operations
- keep applied_seq accurate
- pass seq (not old op_seq) to do_transactions
- carry open_ops ref so that commit blocks until we have finished applying
the full transaction
Sage Weil [Thu, 10 Nov 2011 18:49:32 +0000 (10:49 -0800)]
filestore: sync after non-idempotent operations
This is a big hammer to fix journal replay on non-btrfs fs backends (extN,
xfs, whatever). The problem is that it is not safe to replay some journal
operations more than once, notably things like CLONE whose source data
may be changed by subsequent operations.
The simple fix is to initiate a full commit after any non-idempotent
operations prior to any subsequent operation within the same Sequencer.
This is done by calling trigger_commit() in _do_transactions(), which means
any potentially dependent operation that follows will get blocked because
a commit is about to start.
I made trigger_commit() a bit more robust to callers who are not holding
an open_ops ref to also succeeding if the given op_seq is already
committing. For the current caller, that can't happen.
There are probably better performing solutions, but this one is at least
correct.
Fixes: #213 Signed-off-by: Sage Weil <sage@newdream.net>
Samuel Just [Thu, 10 Nov 2011 22:07:12 +0000 (14:07 -0800)]
OSD: sync_and_flush afer mkfs to create first snap
Previously, if we kill the OSD process before the filestore
does its first sync, we end up replaying the journal on top
of current and potentially hitting -EEXIST.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Wed, 9 Nov 2011 22:34:30 +0000 (14:34 -0800)]
crypto: make crypto handlers non-static
These were static in auth/Crypto.cc, which was mostly fine, except when
we got a signal shutting everything down for the gcov stuff, like so:
Thread 21 (Thread 2164):
#0 0x00007f31a800b3cd in open64 () from /lib/libpthread.so.0
#1 0x000000000081dee0 in __gcov_open ()
#2 0x000000000081e3fd in gcov_exit ()
#3 0x00007f31a67e64f2 in exit () from /lib/libc.so.6
#4 0x000000000054e1ca in handle_signal (signal=<value optimized out>) at osd/OSD.cc:600
#5 <signal handler called>
#6 0x00007f31a8007a9a in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#7 0x0000000000636d7b in Wait (this=0x2241000) at ./common/Cond.h:48
#8 SimpleMessenger::wait (this=0x2241000) at msg/SimpleMessenger.cc:2637
#9 0x00000000004a4e35 in main (argc=<value optimized out>, argv=<value optimized out>) at ceph_osd.cc:343
and a racing thread would, say, accept a connection and then crash, like
so:
#0 0x00007f31a800ba0b in raise () from /lib/libpthread.so.0
#1 0x0000000000696eeb in reraise_fatal (signum=2164) at global/signal_handler.cc:59
#2 0x00000000006976cc in handle_fatal_signal (signum=<value optimized out>) at global/signal_handler.cc:106
#3 <signal handler called>
#4 0x00007f31a67e0ba5 in raise () from /lib/libc.so.6
#5 0x00007f31a67e46b0 in abort () from /lib/libc.so.6
#6 0x00007f31a70846bd in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#7 0x00007f31a7082906 in ?? () from /usr/lib/libstdc++.so.6
#8 0x00007f31a7082933 in std::terminate() () from /usr/lib/libstdc++.so.6
#9 0x00007f31a708328f in __cxa_pure_virtual () from /usr/lib/libstdc++.so.6
#10 0x0000000000690e5b in CryptoKey::decrypt (this=0x7f3195a67510, in=..., out=..., error=...) at auth/Crypto.cc:404
#11 0x000000000079ccee in void decode_decrypt_enc_bl<CephXServiceTicketInfo>(CephXServiceTicketInfo&, CryptoKey, ceph::buffer::list&, std::basic_string<char, std::char_traits<char>, std::allocator<char> >&) ()
#12 0x0000000000795ca3 in cephx_verify_authorizer (cct=0x2232000, keys=<value optimized out>, indata=...,
ticket_info=<value optimized out>, reply_bl=<value optimized out>) at auth/cephx/CephxProtocol.cc:438
#13 0x00000000007a17cf in CephxAuthorizeHandler::verify_authorizer (this=<value optimized out>, cct=0x2232000, keys=0x2256000,
authorizer_data=<value optimized out>, authorizer_reply=..., entity_name=..., global_id=@0x7f3195a67848, caps_info=...,
auid=0x7f3195a67840) at auth/cephx/CephxAuthorizeHandler.cc:21
#14 0x00000000005577ff in OSD::ms_verify_authorizer (this=0x2267000, con=0x230da00, peer_type=<value optimized out>,
protocol=<value optimized out>, authorizer_data=<value optimized out>, authorizer_reply=<value optimized out>,
isvalid=@0x7f3195a67c0f) at osd/OSD.cc:2723
#15 0x0000000000611ce1 in ms_deliver_verify_authorizer (this=<value optimized out>, con=0x230da00, peer_type=4, protocol=2,
authorizer=<value optimized out>, authorizer_reply=<value optimized out>, isvalid=@0x7f3195a67c0f) at msg/Messenger.h:145
#16 SimpleMessenger::verify_authorizer (this=<value optimized out>, con=0x230da00, peer_type=4, protocol=2,
authorizer=<value optimized out>, authorizer_reply=<value optimized out>, isvalid=@0x7f3195a67c0f)
at msg/SimpleMessenger.cc:2419
#17 0x00000000006309ab in SimpleMessenger::Pipe::accept (this=0x22ce280) at msg/SimpleMessenger.cc:756
#18 0x0000000000634711 in SimpleMessenger::Pipe::reader (this=0x22ce280) at msg/SimpleMessenger.cc:1546
#19 0x00000000004a7085 in SimpleMessenger::Pipe::Reader::entry (this=<value optimized out>) at msg/SimpleMessenger.h:208
#20 0x000000000060f252 in Thread::_entry_func (arg=0x874) at common/Thread.cc:42
#21 0x00007f31a8003971 in start_thread () from /lib/libpthread.so.0
#22 0x00007f31a689392d in clone () from /lib/libc.so.6
#23 0x0000000000000000 in ?? ()
Instead, put these on the heap. Set them up in the ceph::crypto::init()
method, and tear them down in ceph::crypto::shutdown().
Fixes: #1633 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Samuel Just [Tue, 8 Nov 2011 18:54:57 +0000 (10:54 -0800)]
PG: cache read-only reference to the current osdmap on pg lock
Previously, we needed to grab an osd_map read lock to send messages,
among other things. Now, we grab a reference to the osd_map on pg lock
and refer to that.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Samuel Just [Tue, 8 Nov 2011 17:45:44 +0000 (09:45 -0800)]
OSDMap,CrushWrapper: const cleanup on OSDMap
The osd's cached maps are not actually modified once cached. Marking
these methods const (which they should be) allows us to make OSDMapRef
shared_ptr<const OSDMap>.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Samuel Just [Wed, 2 Nov 2011 21:32:17 +0000 (14:32 -0700)]
PG: always add backlog entry
Previously, we did not add a backlog entry if the object already had an
entry in the log along with an entry for that entry's prior_version.
However, when scanning the log, an OSD will incorrectly conclude that it
has the prior_version's prior_version if the object is not already in
the missing set. If there happens to be a clone entry with that version
as it's prior_version, the osd will attempt to recover the clone via a
clone operation on the non-existent object.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
If the rbd showmapped cmd is given any extra arguments, rbd will fail
with "assert(0)". Fix it by exiting with "usage_exit()", if any
arguments are present, instead of failing.
Noah Watkins [Wed, 9 Nov 2011 02:39:20 +0000 (18:39 -0800)]
hadoop: return all replica hostnames
Updates CephFileSystem to return all replica locations,
and in addition attempts to use reverse DNS to convert
the OSD IPs into hostnames. Hadoop does not do well at
comparing the IP with hostnames, and locality is lost.
Noah Watkins [Wed, 9 Nov 2011 02:39:19 +0000 (18:39 -0800)]
hadoop: handle new ceph_get_file_stripe_address
Updates the Hadoop JNI/CephFileSystem to handle
the new version of ceph_get_file_stripe_address
which returns the locations of replicas in addition
to the primary.
Noah Watkins [Wed, 9 Nov 2011 02:39:18 +0000 (18:39 -0800)]
client: return stripe address replicas
Changes ceph_get_file_stripe_address to return a
vector of entity_addr_t's for the primary and the
replicas. libcephfs is updated to return the
associated sockaddr_storage for each address.