Samuel Just [Sat, 5 Nov 2011 00:36:21 +0000 (17:36 -0700)]
OSD: write_info/log before dropping lock in generate_backlog
Bug #1530
This should fix the following race:
1) osd->generate_backlog does pg->assemble_backlog
2) osd->generate_backlog drops the pg lock to grab the osd_map read lock
3) ...which is held by osd->handle_osd_map
4) at the end of osd->handle_osd_map, we call write_info on the pg since
it has progressed to a new peering interval
5) osd->generate_backlog gets the read_lock and the pg lock and promptly
bails since the backlog generation has been cancelled
6) osd dies, but not before the write_info transaction is durable
The result of this is that the in-memory backlog generated in
assemble_backlog doesn't make it to disk, but the updated info does
resulting in an ondisklog inconsistent with the pg info on osd
restart.
This should prevent the info from being written without the log.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Using sync_file_range means that neither any required metadata gets commited,
nor the disk cache gets flushed. Stop using it for the journal, and add
a comment on why a fsync_range system call would be helpful here.
Btw, why does the code use O_SYNC (and not even O_DSYNC!) if using direct
I/O, but fdatasync/fsync for buffered I/O? Avoiding cache flushes and
metadata updates for every writes is just as important for direct I/O
as it is for buffered I/O.
Greg Farnum [Thu, 3 Nov 2011 20:49:56 +0000 (13:49 -0700)]
test: write a test to try and check on Client::readdir_r_cb.
It's made difficult by having to go through libcephfs, but it's better
than nothing and should catch most of the errors which were detected
while using it in Hadoop.
Noah Watkins [Wed, 2 Nov 2011 19:25:15 +0000 (12:25 -0700)]
hadoop: remove deprecated isDirectory()
Uses the suggested getFileStatus() method for
replacing the deprecated isDirectory(). This is
only marginally slower as get_replication is called
to fill in the FileStatus. If performance ever became
an issue for the paths that use isDirectory() then
getFileStatus can be made faster by pushing more down
into JNI.
Noah Watkins [Wed, 2 Nov 2011 18:58:43 +0000 (11:58 -0700)]
hadoop: remove initialization check
The initialization check is removed because
it is part of Hadoop's treatment of file systems
that initialize() is called prior to any other
file system routines. This makes the code cleaner
but in the future verison of libcephfs-java, internal
initialization checks should still be made.
Noah Watkins [Wed, 2 Nov 2011 04:52:48 +0000 (21:52 -0700)]
hadoop: simplify workingDir handling; add home directory
1. Simplifies the handling of paths by allowing them to be passed
around and manipulated in their fully qualified form. Before
paths are passed into native Ceph calls the path-only portion
is extracted.
2. Sets the initial working directory to be the default home
directory for a user (e.g. /user/<username>/).
Noah Watkins [Wed, 2 Nov 2011 00:25:49 +0000 (17:25 -0700)]
hadoop: emulate Ceph file owner as current user
Make CephFileSystem tell Hadoop that the owner
of all files is the current user. This provides
zero security or isolation, but allows Hadoop
to be used with its default security settings.
A future solution will need to be developed that
provides some isolation, and gives a better user
experience.
Noah Watkins [Tue, 1 Nov 2011 23:35:12 +0000 (16:35 -0700)]
hadoop: use standard log4j logging facility
Replace ceph.debug(msg, level) with LOG.level(msg)
provided by the log4j facility used by Hadoop. The
level can now be provided on a class-by-class basis
by modifying conf/log4j.properties.
Josh Durgin [Tue, 1 Nov 2011 17:40:41 +0000 (10:40 -0700)]
monclient: fail fast when our auth protocols aren't supported
This handles the case where the server does not support any of the
authentication protocols that the client does. Previously this error
would never be propagated, and you'd only know something went wrong
when the optional timeout expired. Now, monclient->authenticate()
fails as soon as it gets the first response from the monitor.
Samuel Just [Tue, 1 Nov 2011 18:16:53 +0000 (11:16 -0700)]
PG: set_last_peering_reset in Reset constructor
If an osd in the prior set comes up, we can restart peering without a
new peering interval starting. However, we still want to ignore
anything we previously requested from replicas.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Josh Durgin [Tue, 1 Nov 2011 17:40:41 +0000 (10:40 -0700)]
monclient: fail fast when our auth protocols aren't supported
This handles the case where the server does not support any of the
authentication protocols that the client does. Previously this error
would never be propagated, and you'd only know something went wrong
when the optional timeout expired. Now, monclient->authenticate()
fails as soon as it gets the first response from the monitor.
Samuel Just [Fri, 28 Oct 2011 21:18:12 +0000 (14:18 -0700)]
PG: Create new snap directories independently on replica
Previously, we shipped over the collection creation as part
of the transaction. However, the snap directory on the
replica might or might not exist already due to recovery
progress.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Josh Durgin [Fri, 28 Oct 2011 01:11:28 +0000 (18:11 -0700)]
auth: return unknown if no supported auth is found
If NONE is supported, it will already be in the list of supported
protocols, so there's no need to default to it here. This prevents
clients that request the NONE protocol from authenticating when the
server only accepts CEPHX. Instead, they get -ENOTSUP from the
AuthMonitor.
Greg Farnum [Fri, 28 Oct 2011 00:24:49 +0000 (17:24 -0700)]
uclient: fix _getdents and add some documentation.
If readdir_r_cb returns 0, that means SUCCESS, regardless of how
many entries it actually wrote.
If it returns <0, then either:
a) our callback failed to fit some data (returning -1), which is fine
unless we took no entries (so return gr.pos or -ERANGE, depending), or
b) something else failed, so return that error code.
If it returns 0, we succeeded, so return the amount of used buffer.
Greg Farnum [Thu, 27 Oct 2011 23:14:58 +0000 (16:14 -0700)]
uclient: align readdirplus_r with readdir_r.
The only user of this code expects to get 1 on a successfully-filled
value, 0 on a successful non-fill, or -errno otherwise. But the
callback returns -1 to indicate it's already been filled in, which
will happen on every single call to a directory with multiple entries...
Sage Weil [Thu, 27 Oct 2011 16:47:20 +0000 (09:47 -0700)]
filejournal: journal_replay_from
Force journal replay from a point other than the op_seq recorded by the
fs. This is useful if you want to skip bad entries in the journal (e.g.,
because they were non-idempotent and you know they were applied and the fs
operations were fully ordered).