Josh Durgin [Thu, 30 Aug 2012 00:30:17 +0000 (17:30 -0700)]
doc: clarify rbd man page (esp. layering)
* a clone's size can't be overridden
* note which commands require format 2
* clarify details of copy
* add examples for cloning
* add pool to map example for consistency
* fix a couple warnings and re-sync man page with rst
Josh Durgin [Wed, 29 Aug 2012 00:24:47 +0000 (17:24 -0700)]
librbd: prevent racing clone and snap unprotect
If the following sequence of events occured,
a clone could be created of an unprotected snapshot:
1. A: begin clone - check that snap foo is protected
2. B: rbd unprotect snap foo
3. B: check that all pools have no clones of foo
4. B: unprotect snap foo
5. A: finish creating clone of foo, add it as a child
To stop this from happening, check at the beginning and end of
cloning that the parent snapshot is protected. If it is not,
or checking protection status fails (possibly because the parent
snapshot was removed), remove the clone and return an error.
Samuel Just [Tue, 11 Sep 2012 18:05:40 +0000 (11:05 -0700)]
ReplicatedPG: do not start_recovery_op if we are already pushing
Should fix bug #2761.
If we are already pushing soid, recovery_ops will only be decremented once for
all current pushes, so only increment recovery_ops if we are not currently
pushing it.
This bug causes us to leak a recovery op and get stuck in backfill.
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Tue, 4 Sep 2012 22:25:20 +0000 (15:25 -0700)]
osd: fill in user log entry last after snapdir tran
Reorder the snapdir logic and ctx->at_version adjustments prior to filling
in the object_info_t and user_versions and all that stuff. Adjust
at_version after appending the log entry (so that it points to the next
position/version we will write at.. culminating in the actual user
event).
The user log entry contains the request id, which will be used
by replay ops to put themselves in the correct place in the
waiting_for_commit/ack maps. Thus, the repop needs to be tagged
with the same version as the log entry with the request id.
Thus, the request id bearing log entry should be the last in
the log entry vector.
This should fix #3072, wherein a replay which should wait on
the repop tagged as version '36 will instead wait on '35.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Tue, 28 Aug 2012 22:14:41 +0000 (15:14 -0700)]
osd: fix waiting_for_disk assertion
If requeue is false, we won't have cleared out waiting_for_ondisk; adjust
assert placement as appropriate. Also, make sur we handle the requeue
and !op case properly (although I'm not sure offhand if/when it would
come up).
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Mike Ryan [Tue, 28 Aug 2012 18:57:03 +0000 (11:57 -0700)]
rados_bench: wait for completion callbacks before returning
If we don't wait for the callback, the finisher may cleanup the callback
context before the callback is actually invoked, causing a
use-after-free error.
Sage Weil [Mon, 27 Aug 2012 21:31:32 +0000 (14:31 -0700)]
osd: requeue dup ops inline with in-progress ops
We should requeue the dups along with the originals. This avoids
situations where, after requeue, the dups are reordered with respect to
each other. For example:
- client sends A, B, C
- osd receives A
- connection drops
- client sends A', B', C'
- osd puts A' in waiting_for_ondisk, starts B' and C'
- on_change() requeues everything
Final queue order (before this patch) is
A, B', C', A'
After this patch, the resulting queue order is
A, A', B', C'
Or somewhat more generally, it might be:
A, A', B, B', B'', C', C'', D'', ....
Fixes (another source of): #2947 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Fri, 24 Aug 2012 21:43:56 +0000 (14:43 -0700)]
mon: describe how pgs are stuck in 'health detail'
Showing the current state and saying it is stuck doesn't tell you how it
is stuck (e.g. stuck unclean, stuck inactive, etc.). Also include the
stuck duration.
Fixes: #2876 Signed-off-by: Sage Weil <sage@inktank.com>
Mike Ryan [Tue, 24 Jul 2012 03:45:31 +0000 (20:45 -0700)]
obj_bencher: remove all benchmark files matching a prefix
This is a fallback for when a user wishes to delete ALL benchmark files
matching a particular prefix. In the fast case, a metadata file tells us
enough to quickly delete the files in parallel. This is the slow case,
where each file's name must be checked against the prefix.
Sage Weil [Thu, 23 Aug 2012 20:26:32 +0000 (13:26 -0700)]
msg/Pipe: conditionally detect session reset
Lossless peers (osd<->osd, mds<->mds, mon<->mon) never reset sessions
to each other. In the osd and mds cases, there is no need to check for
session resets. More significantly, these checks can trigger with an
unfortunately sequence of socket failures. In particular,
- A sends connect request to B
- B accepts, increments connect_seq, then has a socket failure
before telling A
- A reconnects, stil with connect_seq == 0
- B sees connect_seq == 0 and thinks there was a reset
This warrants a closer look in the fs client <-> mds case, but for now,
in the cluster-internal communications, it is moot, since reset
detection is unnecessary.
In the monitor case: we do need to check with resets because the peers
reuse the same entity_addr_t's (nonce==0), which means that a daemon
restart is effectively a reset. In that case, use a different policy
that continues to check for resets.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Sage Weil [Thu, 23 Aug 2012 20:27:26 +0000 (13:27 -0700)]
osd: prefer acting osds in calc_acting()
We currently prefer up osds, and then pull sequentially from peer_info
(strays we know about at the time). This adds an additional preference
for the current acting, which means we can avoid changes to acting when
they are largely useless.
In particular, I observed that we chose [5,3] and later (when recovery
completed) chose [5,1] because we had since heard about an eligible stray
on 1. That switch was basically a waste...
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Dan Mick [Mon, 20 Aug 2012 22:02:57 +0000 (15:02 -0700)]
rbd: force all exiting paths through main()/return
This properly destroys objects. In the process, remove usage_exit();
also kill error-handling in set_conf_param (never relevant for rbd.cc,
and if you call it with both pointers NULL, well...)
Also switch to EXIT_FAILURE for consistency.
Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Fixes: #2948
Sage Weil [Wed, 22 Aug 2012 04:12:33 +0000 (21:12 -0700)]
objecter: use ordered map<> for tracking tids to preserve order on resend
We are using a hash_map<> to map tids to Op*'s. In handle_osd_map(),
we will recalc_op_target() on each Op in a random (hash) order. These
will get put in a temp map<tid,Op*> to ensure they are resent in the
correct order, but their order on the session->ops list will be random.
Then later, if we reset an OSD connection, we will resend everything for
that session in ops order, which is be incorrect.
Fix this by explicitly reordering the requests to resend in
kick_requests(), much like we do in handle_osd_map(). This lets us
continue to use a hash_map<>, which is faster for reasonable numbers of
requests. A simpler but slower fix would be to just use map<> instead.
This is one of many bugs contributing to #2947.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Wed, 22 Aug 2012 04:12:33 +0000 (21:12 -0700)]
objecter: use ordered map<> for tracking tids to preserve order on resend
We are using a hash_map<> to map tids to Op*'s. In handle_osd_map(),
we will recalc_op_target() on each Op in a random (hash) order. These
will get put in a temp map<tid,Op*> to ensure they are resent in the
correct order, but their order on the session->ops list will be random.
Then later, if we reset an OSD connection, we will resend everything for
that session in ops order, which is be incorrect.
Fix this by explicitly reordering the requests to resend in
kick_requests(), much like we do in handle_osd_map(). This lets us
continue to use a hash_map<>, which is faster for reasonable numbers of
requests. A simpler but slower fix would be to just use map<> instead.
This is one of many bugs contributing to #2947.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Tue, 21 Aug 2012 00:04:58 +0000 (17:04 -0700)]
mon: fix monitor cluster contraction race
If we contract to 1 monitor, we win_standalone_election() without bumping
the election epoch. Racing paxos updates can then reach us without being
ignored and trigger an assert:
mon/Paxos.cc: In function 'void Paxos::handle_accept(MMonPaxos*)' thread 7f85eae05700 time 2012-08-20 16:01:00.843937
mon/Paxos.cc: 468: FAILED assert(state == STATE_UPDATING)
Fixes: #3003 Reported-by: John Wilkins <john.wilkins@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com>
Tommi Virtanen [Tue, 21 Aug 2012 00:06:09 +0000 (17:06 -0700)]
mkcephfs, init-ceph: Warn if hostname "localhost" is seen in ceph.conf.
Given a ceph.conf that looks like
[osd.42]
host = localhost
mkcephfs used to exit with an obscure error message:
cat: /tmp/mkcephfs.MCBIHvn4Ru/key.*: No such file or directory
"localhost" was never intended to be a valid hostname to use there.
Warn if we see it, and skip the entry. You should use the proper short
hostname of the box.
As init-ceph and mkcephfs share this library, this change affects the
sysvinit scripts too. The behavior *shouldn't* change there (localhost
entries were ignored earlier, too), but you may see this extra
warning. Which is good.
Closes: #3001 Signed-off-by: Tommi Virtanen <tv@inktank.com>
Sage Weil [Mon, 20 Aug 2012 19:33:08 +0000 (12:33 -0700)]
osd: fix requeue order of dup ops
The waiting_for_ondisk (and ack) maps get dups of ops that are in progress.
If we have a peering change in which the role does not change, we will
requeue the in-progress ops but leave these in the waiting_for_ondisk
maps, which will then trigger an assert the next time we examine that map
and find it didn't match up with what we expected.
Fix this by requeuing these on any peering reset in on_change(). This
keeps the two queues in sync.
Fixes: #2956 Signed-off-by: Sage Weil <sage@inktank.com>
Travis Rhoden [Mon, 20 Aug 2012 20:29:11 +0000 (13:29 -0700)]
init-ceph: use SSH in "service ceph status -a" to get version
When running "service ceph status -a", a version number was never
returned for remote hosts, only for the local. This was because
the command to query the version number didn't use the do_cmd
function, which is responsible for running the command over SSH
when needed.
Modify the ceph init.d script to use do_cmd for querying the
Ceph version.
Travis Rhoden [Fri, 17 Aug 2012 20:45:09 +0000 (16:45 -0400)]
doc: mkcephfs man page, -c ceph.conf is not optional
[ The following text is in the "ISO-8859-1" character set. ]
[ Your display is set for the "ANSI_X3.4-1968" character set. ]
[ Some characters may be displayed incorrectly. ]
The man page for mkcephfs and the output of mkcephfs --help
do not agree with each other. the man page says -c ceph.conf
is optional, while mkcephfs --help says it is required.
Through empirical evidence, I believe it is required. Update
the man page to make it so.
Sage Weil [Fri, 17 Aug 2012 16:02:10 +0000 (09:02 -0700)]
mds: do not return null dentry lease on getattr
Specifically, /foo may exist and client may try to mount /foo/bar. That
GETATTR request is on #1/foo/bar, but we cannot return a null dentry on bar
because the client is not prepared to handle it and will crash in
fill_trace().
Fixes: #2959 Reported-by: Yan Zheng <zheng.z.yan@intel.com> Signed-off-by: Sage Weil <sage@inktank.com>