Sage Weil [Sun, 1 Jul 2012 22:37:31 +0000 (15:37 -0700)]
msgr: choose incoming connection if ours is STANDBY
If the connect_seq matches, but our existing connection is in STANDBY, take
the incoming one. Otherwise, the other end will wait indefinitely for us
to connect but we won't.
Alternatively, we could "win" the race and trigger a connection by sending
a keepalive (or similar), but that is more work; we may as well accept the
incoming connection we have now.
This removes STANDBY from the acceptable WAIT case states. It also keeps
responsibility squarely on the shoulders of the peer with something to
deliver.
Without this patch, a 3-osd vstart cluster with
'ms inject socket failures = 100' and rados bench write -b 4096 would start
generating slow request warnings after a few minutes due to the osds
failing to connect to each other. With the patch, I complete a 10 minute
run without problems.
Sage Weil [Fri, 29 Jun 2012 00:50:47 +0000 (17:50 -0700)]
msgr: preserve incoming message queue when replacing pipes
If we replace an existing pipe with a new one, move the incoming queue
of messages that have not yet been dispatched over to the new Pipe so that
they are not lost. This prevents messages from being lost.
Alternatively, we could set in_seq = existing->in_seq - existing->in_qlen,
but that would make the other end resend those messages, which is a waste
of bandwidth.
Very easy to reproduce the original bug with 'ms inject socket failures'.
Sage Weil [Fri, 29 Jun 2012 00:38:34 +0000 (17:38 -0700)]
msgr: move incoming queue to separate class
This extricates the incoming queue and its funky relationship with
DispatchQueue from Pipe and moves it into IncomingQueue. There is now a
single IncomingQueue attached to each Pipe. DispatchQueue is now no
longer tied to Pipe.
This modularizes the code a bit better (tho that is still a work in
progress) and (more importantly) will make it possible to move the
incoming messages from one pipe to another in accept().
Sage Weil [Thu, 28 Jun 2012 00:06:40 +0000 (17:06 -0700)]
msgr: make D_CONNECT constant non-zero, fix ms_handle_connect() callback
A while ago we inadvertantly broke ms_handle_connect() callbacks because
of a check for m being non-zero in the dispatch_entry() thread. Adjust the
enums so that they get delivered again.
This fixes hangs when, for example, the ceph tool sends a command, gets a
connection reset, and doesn't get the connect callback to resend after
reconnecting to a new monitor.
Sage Weil [Wed, 27 Jun 2012 00:07:31 +0000 (17:07 -0700)]
msgr: do not try to reconnect con with CLOSED pipe
If we have a con with a closed pipe, drop the message. For lossless
sessions, the state will be STANDBY if we should reconnect. For lossy
sessions, we will end up with CLOSED and we *should* drop the message.
Sage Weil [Wed, 27 Jun 2012 00:06:41 +0000 (17:06 -0700)]
msgr: move to STANDBY if we replace during accept and then fail
If we replace an existing pipe during accept() and then fail, move to
STANDBY so that our connection state (connect_seq, etc.) is preserved.
Otherwise, we will throw out that information and falsely trigger a
RESETSESSION on the next connection attempt.
Sage Weil [Mon, 7 May 2012 21:37:38 +0000 (14:37 -0700)]
filestore: set min flush size
If a write is smaller than some threshold, do not bother to flush it; let
the fs do that (efficiently, we hope) at commit time. Focus on the big
writes.
Yehuda Sadeh [Wed, 27 Jun 2012 00:16:11 +0000 (17:16 -0700)]
rest-bench: mark request as complete later
We marked a request as complete in the callback, however
it might be that we're still inside S3_runall_request_context()
which means that request is not really complete yet.
Possibly fixes bug #2652.
If we play 1-4 and then replay 1-4 again, we will end up removing
(b, 1)'s attributes since nlink for (a, 1) the second time through
is 1. We fix this by marking spos on the object_map header for
(a, 1) when we remove (a, 1) but not eh attributes.
- keyrings have new default locations that everyone should use.
- the user key setup is vastly simplified if you use the
'ceph auth get-or-create' command.
mon: MonmapMonitor: Use default port when the specified on 'add' is zero
Fixes a bug triggered by using the ceph tool to 'mon add' with a port set
to zero. We now default to the monitor's default port (6789) instead, and
we will fail if that port is already assigned to some other monitor.
Fixes: bug #2661 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Yehuda Sadeh [Wed, 27 Jun 2012 00:16:11 +0000 (17:16 -0700)]
rest-bench: mark request as complete later
We marked a request as complete in the callback, however
it might be that we're still inside S3_runall_request_context()
which means that request is not really complete yet.
Possibly fixes bug #2652.
Florian Haas [Sat, 23 Jun 2012 09:50:24 +0000 (10:50 +0100)]
radosgw-admin: improve man page
* remove "OpenStack user" information (deprecated, should no
longer be used. No reason to keep mentioning it)
* fix description of --uid
* mention subusers
* add key management commands
Florian Haas [Thu, 14 Jun 2012 21:07:44 +0000 (23:07 +0200)]
doc: explain how to configure Ceph for radosgw
* explain creating auth creds for radosgw
* explain Apache config for radosgw
* explain starting daemons for radosgw
* explain creating users for radosgw
* explain creating subusers for radosgw
* explain using swift client against radosgw
* Explain user:subuser to tenant:user mapping in Swift
Sage Weil [Wed, 20 Jun 2012 18:07:29 +0000 (11:07 -0700)]
objecter: do not feed session to op_submit()
The linger_send() method was doing this, but it is problematic because the
new Op doesn't get its pgid or acting vector set correctly. The result is
that the request goes to the right OSD, but has the wrong pgid, and makes
the OSD complain about misdirected requests and drop it on the floor. It
didn't affect the test results because we weren't testing whether the
watch was working in that case.
Instead, we'll just recalculate and get the same value the parent linger
op did. Which is fine, and goes through all the usual code paths so
nothing is missed.
Also, increment num_homeless_ops before we recalc_op_target(), so that we
don't (harmlessly, but confusingly) underflow.
Fixes: #2022 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Wed, 20 Jun 2012 18:07:29 +0000 (11:07 -0700)]
objecter: do not feed session to op_submit()
The linger_send() method was doing this, but it is problematic because the
new Op doesn't get its pgid or acting vector set correctly. The result is
that the request goes to the right OSD, but has the wrong pgid, and makes
the OSD complain about misdirected requests and drop it on the floor. It
didn't affect the test results because we weren't testing whether the
watch was working in that case.
Instead, we'll just recalculate and get the same value the parent linger
op did. Which is fine, and goes through all the usual code paths so
nothing is missed.
Also, increment num_homeless_ops before we recalc_op_target(), so that we
don't (harmlessly, but confusingly) underflow.
Fixes: #2022 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Thu, 21 Jun 2012 14:31:47 +0000 (07:31 -0700)]
mon: encoding new monmap using quorum feature set
It is probably unlikely that someone will expand the mon cluster with a
mixed feature set, but we know the quorum features here, so we should use
them.
Sage Weil [Thu, 21 Jun 2012 03:41:17 +0000 (20:41 -0700)]
mon: conditionally encode auth incremental with quorum feature bits
If the quorum does not yet all have the MONENC feature, stick to the old
encoding.
It might be more polite to require a super-quorum before switching over,
and take note so that thereafter we can stick to the new encoding, but
that has more moving parts and I'm not sure it's worth the complexity.
Sage Weil [Thu, 21 Jun 2012 03:33:41 +0000 (20:33 -0700)]
mon: track intersection of quorum member features
When we form a quorum, also note the intersection of the quorum members'
feature bits. This will inform decisions about what encodings we use.
This is an imperfect strategy because the quorum may change, and we may
have a mon with old code join in and not understand what is going on.
However, it does ensure that a majority of the members run new code, so in
the absence of other failures we can make progress.
This was an ill-conceived approach to getting atomic transactions out of
btrfs. It doesn't offer rollback, which means that any error means we need
to wedge the file system and reboot in order to avoid corrupting the
data set. And that's silly!
Snapshots are more robust and only marginally slower (because we have to
quiesce our writes while waiting for the snap to start, and btrfs resume
work in-kernel slightly faster...maybe).
Fixes: #2623 Signed-off-by: Sage Weil <sage@inktank.com>