John Spray [Wed, 17 Dec 2014 14:06:53 +0000 (14:06 +0000)]
tools/cephfs: add recover_dentries to journaltool
This is intended as a comparatively safe recovery
operation, where we compare the versions
of journalled dentries with backing store dentries,
and write into the backing store only when the
existing contents are older than the journal
or invalid.
Fixes: #9883 Signed-off-by: John Spray <john.spray@redhat.com>
Sage Weil [Mon, 19 Jan 2015 00:49:20 +0000 (16:49 -0800)]
mon: handle case where mon_globalid_prealloc > max_global_id
This triggers with the new larger mon_globalid_prealloc value. It didn't
trigger on the existing cluster I tested on because it already had a very
large max.
Sage Weil [Sun, 18 Jan 2015 18:39:25 +0000 (10:39 -0800)]
mon: change mon_globalid_prealloc to 10000 (from 100)
100 ids (session 100 authentications) can be consumed quite quickly if
the monitor is being queried by the CLI via scripts or on a large cluster,
especially if the propose interval is long (many seconds). These live in
a 64-bit value and are only "lost" if we have a mon election before they
are consumed, so there's no real risk here.
Backport: giant, firefly Reviewed-by: Joao Eduardo Luis <joao@redhat.com> Signed-off-by: Sage Weil <sage@redhat.com>
Yehuda Sadeh [Fri, 16 Jan 2015 01:30:24 +0000 (17:30 -0800)]
rgw: bilog marker related fixes
Fix the way we parse the marker. Instead of specifying whether it's a
sharded or not sharded bucket, we pass a shard_id. If string itself
points to a singe shard, we'll use the passed shard_id, otherwise we'll
parse the string and determine the shard id by that. In this way when
referencing a single shard we can get the marker with either shard id
specified or not. This works with the non-shard case too.
Adjust the bilog listing function, set it to work with the new
interface. It was broken before, and there are multiple fixes to it.
mon: mkfs compatset may be different from runtime compatset
When we create a monitor we set a given number of compat features on
disk to clearly state the features a given monitor supports -- mostly to
break backward compatibility when such compatibility cannot be
guaranteed.
However, we may wish to toggle some features during runtime; e.g., wait
for all the monitors in the quorum to support a given feature before
flipping a switch and state that all monitors now require feature X.
We are already flipping those switches during runtime, but we weren't
allowing the monitor to set a subset of those features during mkfs.
While the initial approach worked fine with clusters being upgraded and
fresh clusters, it could become weird in a mixed-version environment.
Backport: emperor,firefly,giant
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
Jianpeng Ma [Fri, 16 Jan 2015 08:14:17 +0000 (16:14 +0800)]
test: Using different filename for different test case.
Some test case use tmp file to test.But they used same file and create
in the same directory. If we do in parallel, it will cause error.
So different test case use own their tmp file.
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
John Spray [Fri, 2 Jan 2015 17:48:25 +0000 (17:48 +0000)]
tools: create cephfs-table-tool
It was unnatural to shoehorn resetting tables
into the journaltool. This new tool initially
can simply dump or reset the session/snap/ino
tables, and would also be a place for any
more complex operations in future.
John Spray [Fri, 16 Jan 2015 00:00:56 +0000 (00:00 +0000)]
mds: abstract SessionMapStore from SessionMap
This is similar to what I did for InodeStore a while back:
introduce a logical separation between the persisted attributers
(and their encoding) and the live/runtime behavioural code. This
results in a handy SessionMapStore class that can be used for
encode/decode from tools.
Also give it a reset_state method so that it matches the
prototype of the MDSTable subclasses for the benefit of
cephfs-table-tool.
Loic Dachary [Tue, 6 Jan 2015 20:55:25 +0000 (21:55 +0100)]
erasure-code: tests use different pool/profile names
Use different erasure coded pool names and profiles to avoid deletion /
creation races. The more expensive alternative is to run a different
cluster for each test.
Add a new section to the PG troubleshooting section that covers the most
common problems reported when an erasure coded pool fails to properly
map PGs to enough OSDs.
Loic Dachary [Thu, 18 Dec 2014 00:25:54 +0000 (01:25 +0100)]
erasure-code: set max_size to chunk_count() instead of 20
The ruleset created for an erasure coded pool has max_size set to a
fixed value of 20, which may be incorrect when more than 20 chunks are
needed and lead to obscure errors. Set it to the number of chunks,
i.e. k+m most of the time.
In a cluster with few OSDs (9 for instance), setting max_size to 20
causes performance problems when injecting a new crushmap. The monitor
will call CrushTester::test which tries 1024 mappins for all sizes
ranging from min_size to max_size. Each attempt to map more OSDs than
available will exhaust all retries (50 by default) and it takes a
significant amount of time. In a cluster with 9 OSDs, testing one such
ruleset can take up to 5 seconds.
Since the test blocks the monitor leader, a few erasure coded rulesets
will block the monitor long enough to exceed the timeouts and trigger an
election.
Loic Dachary [Wed, 17 Dec 2014 15:06:55 +0000 (16:06 +0100)]
crush: set_choose_tries = 100 for erasure code rulesets
It is common for people to try to map 9 OSDs out of a 9 OSDs total ceph
cluster. The default tries (50) will frequently lead to bad mappings for
this use case. Changing it to 100 makes no significant CPU performance
difference, as tested manually by running crushtool on one million
mappings.
Haomai Wang [Wed, 14 Jan 2015 18:32:25 +0000 (02:32 +0800)]
AsyncConnection: Fix deadlock if socket failed when replacing
If client reconnect a already mark_down endpoint, server-side will detect
remote reset happen, so it will reset existing connection. Meanwhile,
retry tag is received by client-side connection and it will try to
reconnect. Again, client-side connection will send connect_msg with
connect_seq(1). But it will met server-side connection's connect_seq(0),
it will make server-side reply with reset tag. So this connection will
loop in reset and retry tag.
One solution is that we close server-side connection if connect_seq ==0 and
no message in queue. But it will trigger another problem:
1. client try to connect a already mark_down endpoint
2. client->send_message
3. server-side accept new socket, replace old one and reply retry tag
4. client plus one to connect_seq but socket failure happen
5. server-side connection detected and close because of connect_seq==0 and no
message
6. client reconnect, server-side has no existing connection and met
"connect.connect_seq > 0". So server-side will reply to RESET tag
7. client discard all messages in queue. So we lose a message never delivered
This solution add a new "once_session_reset" flag to indicate whether
"existing" reset. Because server-side's connect_seq is 0 only when it never
successfully or ever session reset. We only need to reply RESET tag if ever
session reset.
Haomai Wang [Wed, 14 Jan 2015 14:51:58 +0000 (22:51 +0800)]
AsyncConnection: Don't increment connect_seq if connect failed
If connection sent many messages without acked, then it was marked down.
Next we get a new connection, it will issue a connect_msg with connect_seq=0,
server side need to detect "connect_seq==0 && existing->connect_seq >0",
so it will reset out_q and detect remote reset. But if client side failed
before sending connect_msg, now it will issue a connect_msg with non-zero
connect_seq which will cause server-side can't detect exist remote reset.
Server-side will reply a non-zero in_seq and cause client crash.
Haomai Wang [Wed, 14 Jan 2015 03:14:16 +0000 (11:14 +0800)]
AsyncConnection: Fix replacing cause original state lossy
Because AsyncConnection won't enter "open" tag from "replace" tag,
the codes which set reply_tag won't be used when enter "open" tag.
It will cause server side discard out_q and lose state.