Samuel Just [Thu, 10 Nov 2011 22:07:12 +0000 (14:07 -0800)]
OSD: sync_and_flush afer mkfs to create first snap
Previously, if we kill the OSD process before the filestore
does its first sync, we end up replaying the journal on top
of current and potentially hitting -EEXIST.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Wed, 9 Nov 2011 22:34:30 +0000 (14:34 -0800)]
crypto: make crypto handlers non-static
These were static in auth/Crypto.cc, which was mostly fine, except when
we got a signal shutting everything down for the gcov stuff, like so:
Thread 21 (Thread 2164):
#0 0x00007f31a800b3cd in open64 () from /lib/libpthread.so.0
#1 0x000000000081dee0 in __gcov_open ()
#2 0x000000000081e3fd in gcov_exit ()
#3 0x00007f31a67e64f2 in exit () from /lib/libc.so.6
#4 0x000000000054e1ca in handle_signal (signal=<value optimized out>) at osd/OSD.cc:600
#5 <signal handler called>
#6 0x00007f31a8007a9a in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#7 0x0000000000636d7b in Wait (this=0x2241000) at ./common/Cond.h:48
#8 SimpleMessenger::wait (this=0x2241000) at msg/SimpleMessenger.cc:2637
#9 0x00000000004a4e35 in main (argc=<value optimized out>, argv=<value optimized out>) at ceph_osd.cc:343
and a racing thread would, say, accept a connection and then crash, like
so:
#0 0x00007f31a800ba0b in raise () from /lib/libpthread.so.0
#1 0x0000000000696eeb in reraise_fatal (signum=2164) at global/signal_handler.cc:59
#2 0x00000000006976cc in handle_fatal_signal (signum=<value optimized out>) at global/signal_handler.cc:106
#3 <signal handler called>
#4 0x00007f31a67e0ba5 in raise () from /lib/libc.so.6
#5 0x00007f31a67e46b0 in abort () from /lib/libc.so.6
#6 0x00007f31a70846bd in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#7 0x00007f31a7082906 in ?? () from /usr/lib/libstdc++.so.6
#8 0x00007f31a7082933 in std::terminate() () from /usr/lib/libstdc++.so.6
#9 0x00007f31a708328f in __cxa_pure_virtual () from /usr/lib/libstdc++.so.6
#10 0x0000000000690e5b in CryptoKey::decrypt (this=0x7f3195a67510, in=..., out=..., error=...) at auth/Crypto.cc:404
#11 0x000000000079ccee in void decode_decrypt_enc_bl<CephXServiceTicketInfo>(CephXServiceTicketInfo&, CryptoKey, ceph::buffer::list&, std::basic_string<char, std::char_traits<char>, std::allocator<char> >&) ()
#12 0x0000000000795ca3 in cephx_verify_authorizer (cct=0x2232000, keys=<value optimized out>, indata=...,
ticket_info=<value optimized out>, reply_bl=<value optimized out>) at auth/cephx/CephxProtocol.cc:438
#13 0x00000000007a17cf in CephxAuthorizeHandler::verify_authorizer (this=<value optimized out>, cct=0x2232000, keys=0x2256000,
authorizer_data=<value optimized out>, authorizer_reply=..., entity_name=..., global_id=@0x7f3195a67848, caps_info=...,
auid=0x7f3195a67840) at auth/cephx/CephxAuthorizeHandler.cc:21
#14 0x00000000005577ff in OSD::ms_verify_authorizer (this=0x2267000, con=0x230da00, peer_type=<value optimized out>,
protocol=<value optimized out>, authorizer_data=<value optimized out>, authorizer_reply=<value optimized out>,
isvalid=@0x7f3195a67c0f) at osd/OSD.cc:2723
#15 0x0000000000611ce1 in ms_deliver_verify_authorizer (this=<value optimized out>, con=0x230da00, peer_type=4, protocol=2,
authorizer=<value optimized out>, authorizer_reply=<value optimized out>, isvalid=@0x7f3195a67c0f) at msg/Messenger.h:145
#16 SimpleMessenger::verify_authorizer (this=<value optimized out>, con=0x230da00, peer_type=4, protocol=2,
authorizer=<value optimized out>, authorizer_reply=<value optimized out>, isvalid=@0x7f3195a67c0f)
at msg/SimpleMessenger.cc:2419
#17 0x00000000006309ab in SimpleMessenger::Pipe::accept (this=0x22ce280) at msg/SimpleMessenger.cc:756
#18 0x0000000000634711 in SimpleMessenger::Pipe::reader (this=0x22ce280) at msg/SimpleMessenger.cc:1546
#19 0x00000000004a7085 in SimpleMessenger::Pipe::Reader::entry (this=<value optimized out>) at msg/SimpleMessenger.h:208
#20 0x000000000060f252 in Thread::_entry_func (arg=0x874) at common/Thread.cc:42
#21 0x00007f31a8003971 in start_thread () from /lib/libpthread.so.0
#22 0x00007f31a689392d in clone () from /lib/libc.so.6
#23 0x0000000000000000 in ?? ()
Instead, put these on the heap. Set them up in the ceph::crypto::init()
method, and tear them down in ceph::crypto::shutdown().
Fixes: #1633 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Samuel Just [Tue, 8 Nov 2011 18:54:57 +0000 (10:54 -0800)]
PG: cache read-only reference to the current osdmap on pg lock
Previously, we needed to grab an osd_map read lock to send messages,
among other things. Now, we grab a reference to the osd_map on pg lock
and refer to that.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Samuel Just [Tue, 8 Nov 2011 17:45:44 +0000 (09:45 -0800)]
OSDMap,CrushWrapper: const cleanup on OSDMap
The osd's cached maps are not actually modified once cached. Marking
these methods const (which they should be) allows us to make OSDMapRef
shared_ptr<const OSDMap>.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Samuel Just [Wed, 2 Nov 2011 21:32:17 +0000 (14:32 -0700)]
PG: always add backlog entry
Previously, we did not add a backlog entry if the object already had an
entry in the log along with an entry for that entry's prior_version.
However, when scanning the log, an OSD will incorrectly conclude that it
has the prior_version's prior_version if the object is not already in
the missing set. If there happens to be a clone entry with that version
as it's prior_version, the osd will attempt to recover the clone via a
clone operation on the non-existent object.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
If the rbd showmapped cmd is given any extra arguments, rbd will fail
with "assert(0)". Fix it by exiting with "usage_exit()", if any
arguments are present, instead of failing.
Noah Watkins [Wed, 9 Nov 2011 02:39:20 +0000 (18:39 -0800)]
hadoop: return all replica hostnames
Updates CephFileSystem to return all replica locations,
and in addition attempts to use reverse DNS to convert
the OSD IPs into hostnames. Hadoop does not do well at
comparing the IP with hostnames, and locality is lost.
Noah Watkins [Wed, 9 Nov 2011 02:39:19 +0000 (18:39 -0800)]
hadoop: handle new ceph_get_file_stripe_address
Updates the Hadoop JNI/CephFileSystem to handle
the new version of ceph_get_file_stripe_address
which returns the locations of replicas in addition
to the primary.
Noah Watkins [Wed, 9 Nov 2011 02:39:18 +0000 (18:39 -0800)]
client: return stripe address replicas
Changes ceph_get_file_stripe_address to return a
vector of entity_addr_t's for the primary and the
replicas. libcephfs is updated to return the
associated sockaddr_storage for each address.
Greg Farnum [Wed, 9 Nov 2011 19:43:21 +0000 (11:43 -0800)]
os: rename and make use of the split_threshold parameter.
This was accidentally left out of the must_split calculation. Put it
in, and rename it to split_multiplier (as that is a much better name
for how it's used).
In the default case this won't actually change behavior, but it
makes the behavior configurable as it's supposed to be.
Sage Weil [Wed, 9 Nov 2011 05:13:07 +0000 (21:13 -0800)]
automake: enable 'make V=0'
Enables silent mode for automake generated Makefiles,
and silent mode is _off_ by default. Using V=0 the output
is much easier to read when trying to find warnings:
Sage Weil [Tue, 8 Nov 2011 21:09:13 +0000 (13:09 -0800)]
paxos: fix race between active and commit
If paxos reproposes an old learned value, we have a C_Active waiter, and
also a commit in progress.
When we reach quorum, paxos goes active, and _active() creates a new
pending. A bit later, the _commit callback goes, and we already have that
pending value ready.
Sage Weil [Tue, 8 Nov 2011 06:46:09 +0000 (22:46 -0800)]
mon: slurp latest state from active monitors before joining quorum
If a monitor has been down and is behind, and joins the quorum, the
other nodes will try to send it all of the needed state, which can
bring the cluster to a halt.
Instead, implement a new bootstrap() procedure:
- probe the cluster nodes
- if there is an existing quorum,
- and it is not too far ahead of me, join it (call an election)
- otherwise, slurp down all the newer state and then restart (bootstrap)
- if we see enough online nodes that are not part of the quorum, call
an election.
We still need to add some timeouts.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Wed, 2 Nov 2011 02:45:17 +0000 (19:45 -0700)]
mon: rename election_starting -> restart
These callbacks reset monitor/paxos/paxosesrvice state, which used to
happen when an election started, but will now not necessarily be
immediately followed by an election.
Samuel Just [Sat, 5 Nov 2011 00:36:21 +0000 (17:36 -0700)]
OSD: write_info/log before dropping lock in generate_backlog
Bug #1530
This should fix the following race:
1) osd->generate_backlog does pg->assemble_backlog
2) osd->generate_backlog drops the pg lock to grab the osd_map read lock
3) ...which is held by osd->handle_osd_map
4) at the end of osd->handle_osd_map, we call write_info on the pg since
it has progressed to a new peering interval
5) osd->generate_backlog gets the read_lock and the pg lock and promptly
bails since the backlog generation has been cancelled
6) osd dies, but not before the write_info transaction is durable
The result of this is that the in-memory backlog generated in
assemble_backlog doesn't make it to disk, but the updated info does
resulting in an ondisklog inconsistent with the pg info on osd
restart.
This should prevent the info from being written without the log.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Using sync_file_range means that neither any required metadata gets commited,
nor the disk cache gets flushed. Stop using it for the journal, and add
a comment on why a fsync_range system call would be helpful here.
Btw, why does the code use O_SYNC (and not even O_DSYNC!) if using direct
I/O, but fdatasync/fsync for buffered I/O? Avoiding cache flushes and
metadata updates for every writes is just as important for direct I/O
as it is for buffered I/O.
Greg Farnum [Thu, 3 Nov 2011 20:49:56 +0000 (13:49 -0700)]
test: write a test to try and check on Client::readdir_r_cb.
It's made difficult by having to go through libcephfs, but it's better
than nothing and should catch most of the errors which were detected
while using it in Hadoop.
Noah Watkins [Wed, 2 Nov 2011 19:25:15 +0000 (12:25 -0700)]
hadoop: remove deprecated isDirectory()
Uses the suggested getFileStatus() method for
replacing the deprecated isDirectory(). This is
only marginally slower as get_replication is called
to fill in the FileStatus. If performance ever became
an issue for the paths that use isDirectory() then
getFileStatus can be made faster by pushing more down
into JNI.
Noah Watkins [Wed, 2 Nov 2011 18:58:43 +0000 (11:58 -0700)]
hadoop: remove initialization check
The initialization check is removed because
it is part of Hadoop's treatment of file systems
that initialize() is called prior to any other
file system routines. This makes the code cleaner
but in the future verison of libcephfs-java, internal
initialization checks should still be made.
Noah Watkins [Wed, 2 Nov 2011 04:52:48 +0000 (21:52 -0700)]
hadoop: simplify workingDir handling; add home directory
1. Simplifies the handling of paths by allowing them to be passed
around and manipulated in their fully qualified form. Before
paths are passed into native Ceph calls the path-only portion
is extracted.
2. Sets the initial working directory to be the default home
directory for a user (e.g. /user/<username>/).
Noah Watkins [Wed, 2 Nov 2011 00:25:49 +0000 (17:25 -0700)]
hadoop: emulate Ceph file owner as current user
Make CephFileSystem tell Hadoop that the owner
of all files is the current user. This provides
zero security or isolation, but allows Hadoop
to be used with its default security settings.
A future solution will need to be developed that
provides some isolation, and gives a better user
experience.
Noah Watkins [Tue, 1 Nov 2011 23:35:12 +0000 (16:35 -0700)]
hadoop: use standard log4j logging facility
Replace ceph.debug(msg, level) with LOG.level(msg)
provided by the log4j facility used by Hadoop. The
level can now be provided on a class-by-class basis
by modifying conf/log4j.properties.