Samuel Just [Mon, 27 Jul 2015 20:12:25 +0000 (13:12 -0700)]
ReplicatedPG: enforce write ordering on rollback
Previously, rollback ops could reorder w.r.t other writes due to waiting
on degraded snaps other than head. To fix that, we'll introduce a new
map tracking objects blocked on degraded snaps. A particular object can
only be blocked on one snap at a time (subsequent writes won't get far
enough to add another entry).
It might have been possible use the blocked_by machinery for this, but
it requires that the object have an extant obc, which we may not
have for a missing object. Also, that machinery exists primarily to
support clone_range, which I hope to remove soon.
Zhiqiang Wang [Wed, 10 Jun 2015 06:21:36 +0000 (14:21 +0800)]
osd: copy the reqids even if the object is deleted during promotion
If the object is deleted on the base tier, and the reqids are not copied
during promotion, this again leads to the 'ops not idempotent' problem.
For the copy-get op, this fix copies the reqids even if the object doesn't
exist.
Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
Zhiqiang Wang [Tue, 2 Jun 2015 08:36:56 +0000 (16:36 +0800)]
osd: purge the object from the cache when proxying and not promoting the op
When proxying the write/cache op, if it is decided to not promote the
object, need to purge it from the object_contexts cache. Otherwise, it
causes problems for the later ops on this object.
Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
Zhiqiang Wang [Tue, 2 Jun 2015 08:20:35 +0000 (16:20 +0800)]
osd: set the blocked_by relationship when rolling back to a degraded
object
In a scenario like below:
- A rollback op comes in, and is enqueued.
- Several other ops on the same object come in, and are enqueued.
- The rollback op dispatches, and finds the object which it rollbacks to is
degraded, then this op is pushbacked into a list to wait for the degraded
object to recover.
- The later ops are handled and responded back to client.
- The degraded object recovers. The rollback op is enqueued again and finally
responded to client.
This breaks the op order. Need to set the blocked_by relationship to enqueue
the later ops until the degraded object recovers.
Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
Zhiqiang Wang [Wed, 27 May 2015 06:02:33 +0000 (14:02 +0800)]
osd: explicitly set the reqid when proxying the write op
This is needed as in the following scenario:
- Client sends 3 writes and a read on the same object to base tier
- Set up cache tiering
- Client retries ops and sends the 3 writes and 1 read to the cache tier
- The 3 writes finished on the base tier, say with versions v1, v2 and v3
- Cache tier proxies the 1st write, and start to promote the object for the 2nd
write, the 2nd and 3rd writes and the read are blocked
- The proxied 1st write finishes on the base tier with version v4, and returns
to cache tier. But somehow the cache tier fails to send the reply due to socket
failure injecting
- Client retries the writes and the read again, the writes are identified as
dup ops
- The promotion finishes, it copies the pg_log entries from the base tier and
put it in the cache tier's pg_log. This includes the 3 writes on the base tier
and the proxied write
- The writes dispatches after the promotion, they are identified as completed
dup ops. Cache tier replies these write ops with the version from the base tier
(v1, v2 and v3)
- In the last, the read dispatches, it reads the version of the proxied write
(v4) and replies to client
- Client complains that 'racing read got wrong version'
Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
Zhiqiang Wang [Thu, 18 Dec 2014 05:31:04 +0000 (13:31 +0800)]
osd/ReplicatedPG: promote on 2nd write
If min_write_recency_for_promote is
- 0: Promote when there is a write.
- 1: Check if the object is in current hit set. Promote if yes.
- else: Check if the object is in current and other in memory hit sets.
Promote if yes.
Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
Conflicts:
src/osd/ReplicatedPG.cc
Loic Dachary [Tue, 18 Aug 2015 12:43:15 +0000 (14:43 +0200)]
ceph-disk: only call restorecon when available
9db80da12803d42bb676d67f37442c0c54d83448 added an unconditional call to
restorecon after mounting the filesystem. It fails when restorecon is
not available and must be made conditional.
xinxin shu [Thu, 13 Aug 2015 03:57:58 +0000 (11:57 +0800)]
fix print error of rados bench
Total time run: 12.279167
Total writes made: 92
Write size: 4194304
Bandwidth (MB/sec): 30
Stddev Bandwidth: 23.4
Max bandwidth (MB/sec): 64
Min bandwidth (MB/sec): 2
Average IOPS: 7
Stddev IOPS: 6
Max IOPS: 32767
Min IOPS: -1537890352
Average Latency: 2.12
Stddev Latency: 1.35
Max latency: 6.05
Min latency: 0.501
Sage Weil [Thu, 13 Aug 2015 18:49:40 +0000 (14:49 -0400)]
test/encoding/check-generated: test sorted json dumps for nondeterministic objects
Nondeterministic objects dump nondeterministically (usually due
to unordered_map or _set). Compare their sorted output. This
is a weaker test but is better than nothing.
we may create a CephContext without calling common_init_finish(), then
delete the CephContext. In this case, ceph::crypto:init() is not called,
so CephContext::~CephContext() should not call ceph::crypto::shutdown().
Fixes: #12598 Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: John Spray <john.spray@redhat.com>
Nathan Cutler [Thu, 13 Aug 2015 13:36:02 +0000 (15:36 +0200)]
ceph.spec.in: test %preun argument is zero for removal-only operations
The %preun section now contains logic for disabling and stopping all the
Ceph systemd units when the ceph package is removed. However, there is no
conditional around it, so the units are disabled and stopped on RPM upgrade
as well as removal.
Loic Dachary [Wed, 12 Aug 2015 12:59:01 +0000 (14:59 +0200)]
tests: be more generous with mon tests timeouts
Change the timeouts of the TEST_mon_add_to_single_mon tests to be 120
seconds. They have value to guard against blocking forever. The exact
timing does not matter as long as the operation completes.
When a timeout is too short, it will create false negatives when running
on slow machines. It could be argued that being too generous with
timeout in general may hide problems. But not in this specific case.
Kefu Chai [Thu, 6 Aug 2015 14:32:42 +0000 (22:32 +0800)]
doc/rados/operations/add-or-rm-mons: simplify the steps to add a mon
this change removes the step to "ceph mon add" before starting a new
monitor. because the existing leader will start an election at seeing
the MMonJoin message sent by the new joiner, after the quorum is
archieved, the monmap will be updated with the new monitor.
so, "ceph mon add" is not necessary to add a new monitor.
moreover, this command will be blocked until a new quorum is formed,
and the proposed monmap is accepted. but in case of adding a monitor
to a single monitor cluster, the leader will wait until at least two
of the monitors reply to it. apparently, this does not happen unless
the new monitor starts. so from the user's point of view, this
command hangs until timesout, if he/she does not start the mon.b
beforehand. but this is an expected behaviour.
so, to avoid this confusion and simplify the steps to add a new
monitor. we'd better simply remove this "ceph mon add" step.
Jason Dillaman [Tue, 11 Aug 2015 13:26:33 +0000 (09:26 -0400)]
librbd: prevent race condition between resize requests
It was possible that the same resize request could be sent twice
if a completed resize op started a newly created resize op while
it was also being concurrently started by another thread.
Fixes: #12664
Backport: hammer Signed-off-by: Jason Dillaman <dillaman@redhat.com>