Below is a patch which makes the ceph-rbdnamer script more robust and
fixes a problem with the rbd udev rules.
On our setup we encountered a symlink which was linked to the wrong rbd:
/dev/rbd/mypool/myrbd -> /dev/rbd1
While that link should have gone to /dev/rbd3 (on which a
partition /dev/rbd3p1 was present).
Now the old udev rule passes %n to the ceph-rbdnamer script, the problem
with %n is that %n results in a value of 3 (for rbd3), but in a value of
1 (for rbd3p1), so it seems it can't be depended upon for rbdnaming.
In the patch below the ceph-rbdnamer script is made more robust and it
now it can be called in various ways:
Even with all these different styles of calling the modified script, it
should now return the same rbdname. This change "has" to be combined
with calling it from udev with %k though.
With that fixed, we hit the second problem. We ended up with:
/dev/rbd/mypool/myrbd -> /dev/rbd3p1
So the rbdname was symlinked to the partition on the rbd instead of the
rbd itself. So what probably went wrong is udev discovering the disk and
running ceph-rbdnamer which resolved it to myrbd so the following
symlink was created:
/dev/rbd/mypool/myrbd -> /dev/rbd3
However partitions would be discovered next and ceph-rbdnamer would be
run with rbd3p1 (%k) as parameter, resulting in the name myrbd too, with
the previous correct symlink being overwritten with a faulty one:
/dev/rbd/mypool/myrbd -> /dev/rbd3p1
The solution to the problem is in differentiating between disks and
partitions in udev and handling them slightly differently. So with the
patch below partitions now get their own symlinks in the following style
(which is fairly consistent with other udev rules):
/dev/rbd/mypool/myrbd-part1 -> /dev/rbd3p1
Please let me know any feedback you have on this patch or the approach
used.
Regards,
Pascal de Bruijn
Unilogic B.V.
Signed-off-by: Pascal de Bruijn <pascal@unilogicnetworks.net> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Sage Weil [Mon, 16 Jul 2012 23:02:14 +0000 (16:02 -0700)]
log: apply log_level to stderr/syslog logic
In non-crash situations, we want to make sure the message is both below the
syslog/stderr threshold and also below the normal log threshold. Otherwise
we get anything we gather on those channels, even when the log level is
low.
Because we don't clear the scrub state before reseting info,
the last_scrub_stamp state in the info.history structure
changes without updating the osd state resulting in the
above assert failure.
Sage Weil [Mon, 16 Jul 2012 03:30:34 +0000 (20:30 -0700)]
mon/MonitorStore: always O_TRUNC when writing states
It is possible for a .new file to already exist, potentially with a
larger size. This would happen if:
- we were proposing a different value
- we crashed (or were stopped) before it got renamed into place
- after restarting, a different value was proposed and accepted.
This isn't so unlikely for the log state machine, where we're
aggregating random messages. O_TRUNC ensure we avoid getting the tail
end of some previous junk.
I observed #2593 and found that a logm state value had a larger size on
one mon (after slurping) than the others, pointing to put_bl_sn_map().
While we are at it, O_TRUNC put_int() too; the same type of bug is
possible there, too.
Fixes: #2593 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Sat, 14 Jul 2012 21:31:34 +0000 (14:31 -0700)]
osd: based misdirected op role calc on acting set
We want to look at the acting set here, nothing else. This was causing us
to erroneously queue ops for later (wasting memory) and to erroneously
print out a 'misdrected op' message in the cluster log (confusion and
incorrect [but ignored] -ENXIO reply).
Fixes: #2022 Signed-off-by: Sage Weil <sage@inktank.com>
rados tool: remove -t param option for target pool
Bug #2772. This fixes an issue that was introduced when we
added the 'rados cp' command. The -t param was already used
for rados bench. With this change the only way to specify
a target pool is using --target-pool.
Though this problem is post argonaut, the 'rados cp' command
has been backported, so we need this fix there too.
Issue #2701. This info wasn't really used anywhere and we weren't
removing it. It was also sharing the same pool namespace as the
info indexed by bucket name, which is bad.
Samuel Just [Tue, 3 Jul 2012 18:23:16 +0000 (11:23 -0700)]
ReplicatedPG: remove faulty scrub assert in sub_op_modify_applied
This assert assumed that all ops submitted before MOSDRepScrub was
submitted were processed by the time that MOSDRepScrub was
processed. In fact, MOSDRepScrub's scrub_to may refer to a
last_update yet to be seen by the replica.
Yehuda Sadeh [Wed, 27 Jun 2012 00:16:11 +0000 (17:16 -0700)]
rest-bench: mark request as complete later
We marked a request as complete in the callback, however
it might be that we're still inside S3_runall_request_context()
which means that request is not really complete yet.
Possibly fixes bug #2652.
If we play 1-4 and then replay 1-4 again, we will end up removing
(b, 1)'s attributes since nlink for (a, 1) the second time through
is 1. We fix this by marking spos on the object_map header for
(a, 1) when we remove (a, 1) but not eh attributes.
- keyrings have new default locations that everyone should use.
- the user key setup is vastly simplified if you use the
'ceph auth get-or-create' command.
mon: MonmapMonitor: Use default port when the specified on 'add' is zero
Fixes a bug triggered by using the ceph tool to 'mon add' with a port set
to zero. We now default to the monitor's default port (6789) instead, and
we will fail if that port is already assigned to some other monitor.
Fixes: bug #2661 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Wed, 20 Jun 2012 18:07:29 +0000 (11:07 -0700)]
objecter: do not feed session to op_submit()
The linger_send() method was doing this, but it is problematic because the
new Op doesn't get its pgid or acting vector set correctly. The result is
that the request goes to the right OSD, but has the wrong pgid, and makes
the OSD complain about misdirected requests and drop it on the floor. It
didn't affect the test results because we weren't testing whether the
watch was working in that case.
Instead, we'll just recalculate and get the same value the parent linger
op did. Which is fine, and goes through all the usual code paths so
nothing is missed.
Also, increment num_homeless_ops before we recalc_op_target(), so that we
don't (harmlessly, but confusingly) underflow.
Fixes: #2022 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Thu, 21 Jun 2012 14:31:47 +0000 (07:31 -0700)]
mon: encoding new monmap using quorum feature set
It is probably unlikely that someone will expand the mon cluster with a
mixed feature set, but we know the quorum features here, so we should use
them.
Sage Weil [Thu, 21 Jun 2012 03:41:17 +0000 (20:41 -0700)]
mon: conditionally encode auth incremental with quorum feature bits
If the quorum does not yet all have the MONENC feature, stick to the old
encoding.
It might be more polite to require a super-quorum before switching over,
and take note so that thereafter we can stick to the new encoding, but
that has more moving parts and I'm not sure it's worth the complexity.
Sage Weil [Thu, 21 Jun 2012 03:33:41 +0000 (20:33 -0700)]
mon: track intersection of quorum member features
When we form a quorum, also note the intersection of the quorum members'
feature bits. This will inform decisions about what encodings we use.
This is an imperfect strategy because the quorum may change, and we may
have a mon with old code join in and not understand what is going on.
However, it does ensure that a majority of the members run new code, so in
the absence of other failures we can make progress.
Samuel Just [Tue, 19 Jun 2012 16:11:57 +0000 (09:11 -0700)]
PG: improve find_best_info
07f853db3982e68b952a337cf91cbf7ec0709de9 is actually too conservative,
it suffices to find any info with a last_update of at least the least
last_update from the last period to go active. An info from a previous
interval is acceptable if the last interval never reported a commited
operation and thus still has the same last_update.
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Mon, 18 Jun 2012 21:00:06 +0000 (14:00 -0700)]
mon: gracefully handle slow 'ceph -w' clients
If we are sending log updates to a client (ceph -w), and they are far
enough behind to drop behind first_committed, include a friendly message
in their stream but continue.
Drop useless return value from _create_sub_incremental(). Assert that we
can read the state file.
Samuel Just [Sat, 16 Jun 2012 00:09:42 +0000 (17:09 -0700)]
PG: best_info must have a last_epoch_started as high as any other info
We disregard incomplete infos during find_best_info, but we can't an
info with a last_epoch_started less that of the incomplete info.
This should avoid cases like #2462. In that case, it appears that
a peer with empty info/log was chosen as authoritative even though
there was a non-empty incomplete peer.