Tommi Virtanen [Thu, 12 Jul 2012 17:47:29 +0000 (10:47 -0700)]
upstart: Make ceph-osd always set the crush location.
This used to be conditional on config having osd_crush_location set,
but with that, minimal configuration left the OSD completely out of
the crush map, and prevented the OSD from starting properly.
Note: Ceph does not currently let this mechanism automatically move
hosts to another location in the CRUSH hierarchy. This means if you
let this run with defaults, setting osd_crush_location later will not
take effect. Set up your config file (or Chef environment) fully
before starting the OSDs the first time.
Yehuda Sadeh [Tue, 28 Aug 2012 23:17:21 +0000 (16:17 -0700)]
rgw: clear usage map before reading usage
Fixes: #3057
Since we read usage in chunks we need to clear the
usage map before reading the next chunk, otherwise
we're going to aggregate the old data as well.
Danny Kukawka [Thu, 16 Aug 2012 10:56:58 +0000 (12:56 +0200)]
fix keyring generation for mds and osd
[ The following text is in the "UTF-8" character set. ]
[ Your display is set for the "ANSI_X3.4-1968" character set. ]
[ Some characters may be displayed incorrectly. ]
Fix config keys for OSD/MDS data dirs. As in documentation and other
places of the scripts the keys are 'osd data'/'mds data' and not
'osd_data'
In case if MDS: if 'mds data' doesn't exist, create it.
Signed-off-by: Danny Kukawka <danny.kukawka@bisect.de>
Danny Kukawka [Thu, 16 Aug 2012 10:56:32 +0000 (12:56 +0200)]
fix ceph osd create help
[ The following text is in the "UTF-8" character set. ]
[ Your display is set for the "ANSI_X3.4-1968" character set. ]
[ Some characters may be displayed incorrectly. ]
Change ceph osd create <osd-id> to ceph osd create <uuid>, since this
is what the command is really doing.
Signed-off-by: Danny Kukawka <danny.kukawka@bisect.de>
Yehuda Sadeh [Wed, 1 Aug 2012 20:22:38 +0000 (13:22 -0700)]
rgw: fix usage trim call encoding
Fixes: #2841.
Usage trim operation was encoding the wrong op structure (usage read).
Since the structures somewhat overlapped it somewhat worked, but user
info wasn't encoded.
It was not encoding user, adding that and reset version
compatibility.
This changes affects command interface, makes use of
radosgw-admin usage trim incompatible. Use of old
radosgw-admin usage trim should be avoided, as it may
remove more data than requested. In any case, upgraded
server code will not handle old client's trim requests.
Yehuda Sadeh [Thu, 2 Aug 2012 18:13:05 +0000 (11:13 -0700)]
rgw: complete multipart upload can handle chunked encoding
Fixes: #2878
We now allow complete multipart upload to use chunked encoding
when sending request data. With chunked encoding the HTTP_LENGTH
header is not required.
Yehuda Sadeh [Wed, 1 Aug 2012 18:19:32 +0000 (11:19 -0700)]
rgw_xml: xml_handle_data() appends data string
Fixes: #2879.
xml_handle_data() appends data to the object instead of just
replacing it. Parsed data can arrive in pieces, specifically
when data is escaped.
Sage Weil [Tue, 31 Jul 2012 21:01:57 +0000 (14:01 -0700)]
osd: peering: detect when log source osd goes down
The Peering state has a generic check based on the prior set osds that
will restart peering if one of them goes down (or one of the interesting
down ones comes up). The GetLog state, however, can pull the log from
a peer that is not in the prior set if it got a notify from them (e.g., an
osd in an old interval that was down when the prior set was calculated).
If that osd goes down, we don't detect it and will block forward.
Fix by adding a simple check in GetLog for the newest_update_osd going
down.
(BTW GetMissing does not suffer from this problem because
peer_missing_requested is a subset of the prior set, so the Peering check
is sufficient.)
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Sat, 28 Jul 2012 17:05:47 +0000 (10:05 -0700)]
osd: set STRAY on pg load when non-primary
The STRAY bit indicates that we should annouce ourselves to the primary,
but it is only set in start_peering_interval(). We also need to set it
initially, so that a PG that is loaded but whose role does not change
(e.g., the stray replica stays a stray) will notify the primary.
Observed:
- osd starts up
- mapping does not change, STRAY not set
- does not announce to primary
- primary does not re-check must_have_unfound, objects appear unfound
Fix this by initializing STRAY when pg is loaded or created whenever we
are not the primary.
Fixes: #2866 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Fri, 27 Jul 2012 23:03:26 +0000 (16:03 -0700)]
osd: peering: make Incomplete a Peering substate
This allows us to still catch changes in the prior set that would affect
our conclusions (that we are incomplete) and, when they happen, restart
peering.
Consider:
- calc prior set, osd A is down
- query everyone else, no good info
- set down, go to Incomplete (previously WaitActingChange) state.
- osd A comes back up (we do nothing)
- osd A sends notify message with good info (we ignore)
By making this a Peering substate, we catch the Peering AdvMap reaction,
which will notice a prior set down osd is now up and move to Reset.
Fixes: #2860 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Fri, 27 Jul 2012 22:39:40 +0000 (15:39 -0700)]
osd: peering: move to Incomplete when.. incomplete
PG::choose_acting() may return false and *not* request an acting set change
if it can't find any suitable peers with enough info to recover. In that
case, we should move to Incomplete, not WaitActingChange, just like we do
a bit lower in GetLog() if we have non-contiguous logs. The state name is
more accurate, and this is also needed to fix bug #2860.
Sage Weil [Thu, 26 Jul 2012 23:35:00 +0000 (16:35 -0700)]
osd: fixing sharing of past_intervals on backfill restart
We need to share past_intervals whenever we instantiate the PG on a peer.
In the PG activation case, this is based on whether our peer_info[] value
for that peer is dne(). However, the backfill code was updating the
peer info (history) in the block preceeding the dne() check, which meant
we never shared past_intervals in this case and the peer would have to
chew through a potentially large number of maps if the PG has not been
clean recently.
Fix by checking dne() prior to the backfill block. We still need to fill
in the message later because it isn't yet instantiated.
Fixes: #2849 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Samuel Just [Mon, 9 Jul 2012 22:53:31 +0000 (15:53 -0700)]
ReplicatedPG: fix replay op ordering
After a client reconnect, the client replays outstanding ops. The
OSD then immediately responds with success if the op has already
committed (version < ReplicatedPG::get_first_in_progress).
Otherwise, we stick it in waiting_for_ondisk to be replied to when
eval_repop concludes that waitfor_disk is empty.
Fixes #2508
Signed-off-by: Samuel Just <sam.just@inktank.com>
Conflicts:
Sage Weil [Wed, 18 Jul 2012 19:55:35 +0000 (12:55 -0700)]
objecter: always resend linger registrations
If a linger op (watch) is sent to the OSD and updates the object, and then
the client loses the reply, it will resend the request. The OSD will see
that it is a dup, however, and not set up the in-memory session state for
the watch. This in turn will break the watch (i.e., notifies won't
get delivered).
Instead, always resend linger registration ops, so that we always have a
unique reqid and do the correct session registeration for each session.
* track the tid of the registation op for each LingerOp
* mark registrations ops as should_resend=false; cancel as needed
* when we send a new registration op, cancel the old one to ensure we
ignore the reply. This is needed becuase we resend linger ops on any
pg change, not just a primary change.
* drop the first_send arg to send_linger(), as we can now infer that
from register_tid == 0.
The bug was easily reproduced with ms inject socket failures = 500 and the
test_stress_watch utility.
Fixes: #2796 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Sage Weil [Fri, 6 Jul 2012 01:08:58 +0000 (18:08 -0700)]
librados: take lock when signaling notify cond
When we are signaling the cond to indicate that a notify is complete,
take the appropriate lock. This removes the possibility of a race
that loses our signal. (That would be very difficult given that there
are network round trips involved, but this makes the lock/cond usage
"correct.")
Sage Weil [Thu, 26 Jul 2012 22:01:05 +0000 (15:01 -0700)]
filestore: check for EIO in read path
Check for EIO in read methods and helpers. Try to do checks in low-level
methods (e.g., lfn_*()) to avoid duplication in higher-level methods.
The transaction apply function already checks for EIO on writes, and will
generate a nicer error message, so we can largely ignore the write path,
as long as errors get passed up correctly.
By default we will assert/fail/crash on EIO from the underlying fs. We
already do this in the write path, but not the read path, or in various
internal infrastructure.
Sage Weil [Wed, 25 Jul 2012 21:53:34 +0000 (14:53 -0700)]
osd: only commit past intervals at end of parallel build
We don't check for gaps in the past intervals, so we should only commit
this when we are completely done. Otherwise a partial run and rsetart will
leave the gap in place, which may confuse the peering code that relies on
this information.
Sage Weil [Wed, 25 Jul 2012 17:57:35 +0000 (10:57 -0700)]
osd: generate past intervals in parallel on boot
Even though we aggressively share past_intervals with notifies etc, it is
still possible for an osd to get buried behind a pile of old maps and need
to generate these if it has been out of the cluster for a while. This has
happened to us in the past but, sadly, we did not merge the work then.
On the bright side, this implementation is much much much cleaner than the
old one because of the pg_interval_t helper we've since switched to.
On bootup, we look at the intervals each pg needs and calclate the union,
and then iterate over that map range. The inner bit of the loop is
functionally identical to PG::build_past_intervals(), keeping the per-pg
state in the pistate struct.
librbd: replace assign_bid with client id and random number
The assign_bid method has issues with replay because it is a write
that also returns data. This means that the replayed operation would
return success, but no data, and cause a create to fail. Instead, let
the client set the bid based on its global id and a random number.
This only affects the creation of new images, since the bid is put
into an opaque string as part of the object prefix.
Keep the server side assign_bid around in case there are old clients
still using it.
Below is a patch which makes the ceph-rbdnamer script more robust and
fixes a problem with the rbd udev rules.
On our setup we encountered a symlink which was linked to the wrong rbd:
/dev/rbd/mypool/myrbd -> /dev/rbd1
While that link should have gone to /dev/rbd3 (on which a
partition /dev/rbd3p1 was present).
Now the old udev rule passes %n to the ceph-rbdnamer script, the problem
with %n is that %n results in a value of 3 (for rbd3), but in a value of
1 (for rbd3p1), so it seems it can't be depended upon for rbdnaming.
In the patch below the ceph-rbdnamer script is made more robust and it
now it can be called in various ways:
Even with all these different styles of calling the modified script, it
should now return the same rbdname. This change "has" to be combined
with calling it from udev with %k though.
With that fixed, we hit the second problem. We ended up with:
/dev/rbd/mypool/myrbd -> /dev/rbd3p1
So the rbdname was symlinked to the partition on the rbd instead of the
rbd itself. So what probably went wrong is udev discovering the disk and
running ceph-rbdnamer which resolved it to myrbd so the following
symlink was created:
/dev/rbd/mypool/myrbd -> /dev/rbd3
However partitions would be discovered next and ceph-rbdnamer would be
run with rbd3p1 (%k) as parameter, resulting in the name myrbd too, with
the previous correct symlink being overwritten with a faulty one:
/dev/rbd/mypool/myrbd -> /dev/rbd3p1
The solution to the problem is in differentiating between disks and
partitions in udev and handling them slightly differently. So with the
patch below partitions now get their own symlinks in the following style
(which is fairly consistent with other udev rules):
/dev/rbd/mypool/myrbd-part1 -> /dev/rbd3p1
Please let me know any feedback you have on this patch or the approach
used.
Regards,
Pascal de Bruijn
Unilogic B.V.
Signed-off-by: Pascal de Bruijn <pascal@unilogicnetworks.net> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Sage Weil [Mon, 16 Jul 2012 23:02:14 +0000 (16:02 -0700)]
log: apply log_level to stderr/syslog logic
In non-crash situations, we want to make sure the message is both below the
syslog/stderr threshold and also below the normal log threshold. Otherwise
we get anything we gather on those channels, even when the log level is
low.
Because we don't clear the scrub state before reseting info,
the last_scrub_stamp state in the info.history structure
changes without updating the osd state resulting in the
above assert failure.
Sage Weil [Mon, 16 Jul 2012 03:30:34 +0000 (20:30 -0700)]
mon/MonitorStore: always O_TRUNC when writing states
It is possible for a .new file to already exist, potentially with a
larger size. This would happen if:
- we were proposing a different value
- we crashed (or were stopped) before it got renamed into place
- after restarting, a different value was proposed and accepted.
This isn't so unlikely for the log state machine, where we're
aggregating random messages. O_TRUNC ensure we avoid getting the tail
end of some previous junk.
I observed #2593 and found that a logm state value had a larger size on
one mon (after slurping) than the others, pointing to put_bl_sn_map().
While we are at it, O_TRUNC put_int() too; the same type of bug is
possible there, too.
Fixes: #2593 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Sat, 14 Jul 2012 21:31:34 +0000 (14:31 -0700)]
osd: based misdirected op role calc on acting set
We want to look at the acting set here, nothing else. This was causing us
to erroneously queue ops for later (wasting memory) and to erroneously
print out a 'misdrected op' message in the cluster log (confusion and
incorrect [but ignored] -ENXIO reply).
Fixes: #2022 Signed-off-by: Sage Weil <sage@inktank.com>
rados tool: remove -t param option for target pool
Bug #2772. This fixes an issue that was introduced when we
added the 'rados cp' command. The -t param was already used
for rados bench. With this change the only way to specify
a target pool is using --target-pool.
Though this problem is post argonaut, the 'rados cp' command
has been backported, so we need this fix there too.
Sage Weil [Tue, 10 Jul 2012 20:18:27 +0000 (13:18 -0700)]
msgr: take over existing Connection on Pipe replacement
If a new pipe/socket is taking over an existing session, it should also
take over the Connection* associated with the existing session. Because
we cannot clear existing->connection_state, we just take another reference.
Clean up the comments a bit while we're here.
This affects MDS<->client sessions when reconnecting after a socket fault.
It probably also affects intra-cluster (osd/osd, mds/mds, mon/mon)
sessions as well, but I did not confirm that.
Backport: argonaut Signed-off-by: Sage Weil <sage@inktank.com>
Issue #2701. This info wasn't really used anywhere and we weren't
removing it. It was also sharing the same pool namespace as the
info indexed by bucket name, which is bad.
Samuel Just [Tue, 3 Jul 2012 18:23:16 +0000 (11:23 -0700)]
ReplicatedPG: remove faulty scrub assert in sub_op_modify_applied
This assert assumed that all ops submitted before MOSDRepScrub was
submitted were processed by the time that MOSDRepScrub was
processed. In fact, MOSDRepScrub's scrub_to may refer to a
last_update yet to be seen by the replica.
Sage Weil [Sun, 1 Jul 2012 22:37:31 +0000 (15:37 -0700)]
msgr: choose incoming connection if ours is STANDBY
If the connect_seq matches, but our existing connection is in STANDBY, take
the incoming one. Otherwise, the other end will wait indefinitely for us
to connect but we won't.
Alternatively, we could "win" the race and trigger a connection by sending
a keepalive (or similar), but that is more work; we may as well accept the
incoming connection we have now.
This removes STANDBY from the acceptable WAIT case states. It also keeps
responsibility squarely on the shoulders of the peer with something to
deliver.
Without this patch, a 3-osd vstart cluster with
'ms inject socket failures = 100' and rados bench write -b 4096 would start
generating slow request warnings after a few minutes due to the osds
failing to connect to each other. With the patch, I complete a 10 minute
run without problems.