David Zafman [Fri, 11 Apr 2014 00:16:33 +0000 (17:16 -0700)]
librados: Add ObjectWriteOperation::snap_rollback() for pool snapshots
snap_rollback() is the same as selfmanaged_snap_rollback() but we want an
independent interface for pool snapshots. Should really take snapname
for consistency with other pool snapshot interfaces.
Signed-off-by: David Zafman <david.zafman@inktank.com>
It's important to assign these for all operations for cases where
g_lockdep isn't yet true when the constructor runs. This is true
for the HeartbeatMap rwlock, among other things, as that thread
is created during early startup before lockdep is enabled. All
of the lockdep hooks assume that they can assign ids on the fly
and not tracking them here breaks things.
test:
Add set_completion*PP() functions to cast arg to correct class
Add return_value checks
Add some reads with buffers larger than object size
Check buffer length on reads
librados:
Make sure *return_value() has bytes read in all cases
Signed-off-by: David Zafman <david.zafman@inktank.com>
David Zafman [Wed, 2 Apr 2014 18:54:51 +0000 (11:54 -0700)]
librados, test: Have write, append and write_full return 0 on success
Fix consistency of write, append, write_full, all return 0 on success
Include C (rados_*) variants, C++ ctx variants
and aio get_return_value() and rados_aio_get_return_value()
Signed-off-by: David Zafman <david.zafman@inktank.com>
Sage Weil [Thu, 10 Apr 2014 20:34:58 +0000 (13:34 -0700)]
mon/OSDMonitor: ignore boot message from before last up_from
It is possible we will have a dup OSDBoot message queued up in the mon
and will process it again after that osd was marked up and then down. If
that happens, we should ignore this message, not mark the osd back in with
the same address.
Fixes: #8062 Signed-off-by: Sage Weil <sage@inktank.com>
Prevoiusly we assumed that if we had snapset_obc set, !exists on the head
and if we got the snapdir lock we were good to take the head lock too.
This is no the case when:
- delete queued
- takes wr lock on both head and snapdir
- delete commits (but not yet applied)
- stat
- tries to take wr lock on head
- blocks, toggles w=1 state on *head only*
- copy-from
- tries to take wr lock on snapdir, succeeds
- tries to take wr lock on head, fails because w=1
- fails the assert(got)
The problem is that the read and write paths are taking different locks
and we are expecting them to operate in synchrony.
Fix this by using the same ordering for reads as well as write: if the
snapset_obc is defined, take the read lock on that too, just as we do with
a write.
Fixes: #8046 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
mon: Monitor: suicide on start if mon has been removed from monmap
If the monitor has been marked as having been part of an existing quorum
and is no longer in the monmap, then it is safe to assume the monitor
was removed from the monmap. In that event, do not allow the monitor
to start, as it will try to find its way into the quorum again (and
someone clearly stated they don't really want them there), unless
'mon force quorum join' is specified.
Fixes: 6789
Backport: dumpling, emperor
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
mds: guarantee message ordering when importing non-auth caps
Current code allow importing non-auth caps when inode is being exported.
This can breaks message ordering because the corresponding cap import
messages are sent after the flush session messages. So they can arrive
at clients after clients have already received cap import messages from
new auth MDS of the inode.
The quick fix is ignore MExportCaps when inode is frozen.
mds: remove wrong assertion for remote frozen authpin
For across authority rename, the MDS first freezes the source inode's
authpin. It happens while the source dentry isn't locked. So when the
inode's authpin become frozen, the source dentry may have changed and
be linked to a different inode.
Sage Weil [Wed, 9 Apr 2014 23:03:05 +0000 (16:03 -0700)]
mon: tell peers missing features during probe
Use a new probe op to inform mons that they are missing features during
the earliest probe phase. This prevents them from getting as far as
the sync entirely if they are too old.
We still need to refuse to speak to them if they try to call an election,
which they could do based on their replies from other peers.
Note that old clients will assert on getting a message type string they
don't understand, so we need to be careful not to send the probe reply
to older clients. The feature bit we use is not precise in that it does
not cover recent dev releases, but it does work for dumpling and emperor.
auth: separate writes of build_request() into prepare_build_request()
validate_tickets() updates internal state, as does
tickets.get_handler(). Move them into a new method called before
build_request() so build_request() can be declared const.
This allows methods using RWLock for reading to be declared const.
There might be cases where we'd want to take a write lock in a const
method, but right now that's unnecessary, and I'd rather get a compile
error.
auth: add rwlock to AuthClientHandler to prevent races
For cephx, build_authorizer reads a bunch of state (especially the
current session_key) which can be updated by the MonClient. With no
locks held, Pipe::connect() calls SimpleMessenger::get_authorizer()
which ends up calling RadosClient::get_authorizer() and then
AuthClientHandler::bulid_authorizer(). This unsafe usage can lead to
crashes like:
Program terminated with signal 11, Segmentation fault.
0x00007fa0d2ddb7cb in ceph::buffer::ptr::release (this=0x7f987a5e3070) at common/buffer.cc:370
370 common/buffer.cc: No such file or directory.
in common/buffer.cc
(gdb) bt
0x00007fa0d2ddb7cb in ceph::buffer::ptr::release (this=0x7f987a5e3070) at common/buffer.cc:370
0x00007fa0d2ddec00 in ~ptr (this=0x7f989c03b830) at ./include/buffer.h:171
ceph::buffer::list::rebuild (this=0x7f989c03b830) at common/buffer.cc:817
0x00007fa0d2ddecb9 in ceph::buffer::list::c_str (this=0x7f989c03b830) at common/buffer.cc:1045
0x00007fa0d2ea4dc2 in Pipe::connect (this=0x7fa0c4307340) at msg/Pipe.cc:907
0x00007fa0d2ea7d73 in Pipe::writer (this=0x7fa0c4307340) at msg/Pipe.cc:1518
0x00007fa0d2eb44dd in Pipe::Writer::entry (this=<value optimized out>) at msg/Pipe.h:59
0x00007fa0e0f5f9d1 in start_thread (arg=0x7f987a5e4700) at pthread_create.c:301
0x00007fa0de560b6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
Fix this by adding an rwlock to AuthClientHandler. A simpler fix would
be to move RadosClient::get_authorizer() into the MonClient() under
the MonClient lock, but this would not catch all uses of other
Authorizer, e.g. for verify_authorizer() and it would serialize
independent connection attempts.
This mainly matters for cephx, but none and unknown can have the
global_id reset as well.
Sage Weil [Wed, 9 Apr 2014 18:13:31 +0000 (11:13 -0700)]
mon: refresh elector required_features when they change
Currently we only refresh required_features on Elector::start(). This
does not prevent an old peer from calling an election (even though they
won't succeed in joining the resulting quorum).
Fix this by updating the elector's features when they change. This way we
don't allow a useless election cycle just to trigger that update in
start().
Zero-length writes would hang because the completion was never
called. Reads would hit an assert about zero length in
Striper::file_to_exents().
Fix all of these cases by skipping zero-length extents. The completion
is created and finished when finish_adding_requests() is called. This
is slightly different from usual completions since it comes from the
same thread as the one scheduling the request, but zero-length aio
requests should never happen from things that might care about this,
like QEMU.
Writes and discards have had this bug since the beginning of
librbd. Reads might have avoided it until stripingv2 was added.
Sage Weil [Wed, 9 Apr 2014 00:28:54 +0000 (17:28 -0700)]
osd: do not block when updating osdmap superblock features
We are holding osd_lock in check_osdmap_features, which means we cannot
block while waiting for filestore operations to flush/apply without
risking deadlock.
The important constraint is that we commit that the feature is enabled
before also commiting anything that utilizes sharded objects. The normal
commit sequencing does that already; there is no reason to block here.
Fixes: #8045 Signed-off-by: Sage Weil <sage@inktank.com>
pipe: only read AuthSessionHandler under pipe_lock
session_security, the AuthSessionHandler for a Pipe, is deleted and
recreated while the pipe_lock is held. read_message() is called
without pipe_lock held, and examines session_security. To make this
safe, make session_security a shared_ptr and take a reference to it
while the pipe_lock is still held, and use that shared_ptr in
read_message().
Sage Weil [Tue, 8 Apr 2014 19:26:19 +0000 (12:26 -0700)]
osd/PG: set CREATING pg state bit until we peer for the first time
We send PG state updates to the monitor while creating a PG before the
actual creation and been finalized and persisted. Because those updates
do not include the CREATING bit, the mon will remove the pgid from it's
creating set. If the OSD(s) crash before persisting that PG creation, the
PG will never get created.
Fix this by leaving the CREATING bit set on the primary as long as
last_epoch_started==0. That is, until we successfully peer for the very
first time. Only then do we clear the bit and tell the monitor it's duty
is complete.
Fixes: #8001 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Tue, 8 Apr 2014 17:52:43 +0000 (10:52 -0700)]
os/FileStore: reset journal state on umount
We observed a sequence like:
- replay journal
- sets JournalingObjectStore applied_op_seq
- umount
- mount
- initiate commit with prevous applied_op_seq
- replay journal
- commit finishes
- on replay commit, we fail assert op > committed_seq
Although strictly speaking the assert failure is harmless here, in general
we should not let state leak through from a previous mount into this
mount or else assertions are in general more difficult to reason about.
Fixes: #8019 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Tue, 8 Apr 2014 17:58:53 +0000 (10:58 -0700)]
vstart.sh: make crush location match up with what init-ceph does
This makes is to that ./init-ceph restart osd.0 won't modify the CRUSH
tree. And in any case, the localhost/localrack thing we were doing before
was pretty useless.
Sage Weil [Tue, 8 Apr 2014 16:01:14 +0000 (09:01 -0700)]
osd: drop unused same_for_*() helpers
These were all identical and mostly served to obscure the actual logic,
which is now captured by can_discard_op() and the matching Objecter
code on the client side.
Sage Weil [Tue, 8 Apr 2014 16:00:11 +0000 (09:00 -0700)]
osd: drop previous interval ops even if primary happens to be the same
If we have two consecutive intervals with the same primary, the client
will not resend the op and the same_primary_since epoch will not change,
and all is well.
If, however, we have 3 intervals, and the primary changes away and then
back to a particular OSD, the OSD will currently still process the old
request (assuming the timing works out) because it is currently the
primary. This is unnecessary because the client will resend the request.
It may even introduce a hard-to-hit ordering problem since whether or not
the OSD processes the message becomes dependent on how many subsequent
maps it has consumed when the request is processed.
Instead, simplify the minor tangle of helpers by making a single simple
check that discards requests from before same_primary_since. We can then
avoid using the same_for_*() helpers and drop the check from
handle_misdireted_op(), which is also nice because the name is now accurate
(it *only* deals with ops that are in fact misdirected, not just slow to
arrive).
The main change is use shared_ptr instead of weak_ptr to define
active request map. The reason is that slave request needs to be
preserved until master explicitly finishes it.
erasure-code: thread-safe initialization of gf-complete
Instead of relying on an implicit initialization happening during
encoding/decoding with galois.c:galois_init_default_field, call
gf.c:gf_init_easy for each w values when the plugin is loaded.
Loading the plugin is protected against race conditions by a lock.
It does not cover all possible uses of gf-complete but it is enough for
the ceph jerasure plugin.