Sage Weil [Thu, 26 Jul 2012 23:35:00 +0000 (16:35 -0700)]
osd: fixing sharing of past_intervals on backfill restart
We need to share past_intervals whenever we instantiate the PG on a peer.
In the PG activation case, this is based on whether our peer_info[] value
for that peer is dne(). However, the backfill code was updating the
peer info (history) in the block preceeding the dne() check, which meant
we never shared past_intervals in this case and the peer would have to
chew through a potentially large number of maps if the PG has not been
clean recently.
Fix by checking dne() prior to the backfill block. We still need to fill
in the message later because it isn't yet instantiated.
Fixes: #2849 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Sage Weil [Fri, 27 Jul 2012 04:55:00 +0000 (21:55 -0700)]
filestore: check for EIO in read path
Check for EIO in read methods and helpers. Try to do checks in low-level
methods (e.g., lfn_*()) to avoid duplication in higher-level methods.
The transaction apply function already checks for EIO on writes, and will
generate a nicer error message, so we can largely ignore the write path,
as long as errors get passed up correctly.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
By default we will assert/fail/crash on EIO from the underlying fs. We
already do this in the write path, but not the read path, or in various
internal infrastructure.
Signed-off-by: Sage Weil <sage@inktank.com>
Conflicts:
Sage Weil [Wed, 25 Jul 2012 17:57:35 +0000 (10:57 -0700)]
osd: generate past intervals in parallel on boot
Even though we aggressively share past_intervals with notifies etc, it is
still possible for an osd to get buried behind a pile of old maps and need
to generate these if it has been out of the cluster for a while. This has
happened to us in the past but, sadly, we did not merge the work then.
On the bright side, this implementation is much much much cleaner than the
old one because of the pg_interval_t helper we've since switched to.
On bootup, we look at the intervals each pg needs and calclate the union,
and then iterate over that map range. The inner bit of the loop is
functionally identical to PG::build_past_intervals(), keeping the per-pg
state in the pistate struct.
Sage Weil [Tue, 24 Jul 2012 21:53:06 +0000 (14:53 -0700)]
admin_socket: json output, always
If the perfcounters stuff were refactored to use the Formatter, we could
put the JSONFormatter in the admin_socket code and make this a bit less
annoying. Later.
Sage Weil [Tue, 24 Jul 2012 18:02:37 +0000 (11:02 -0700)]
osd: fix pg log zeroing
Zero the right number of bytes. Fixes a bug where we clobber legit log
data. Fortunately this is only triggered with osd preserve pg log = false,
which was not the default until recently in master.
Fixes: #2799 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Mike Ryan <mike.ryan@inktank.com>
Pierre Rognant [Wed, 25 Apr 2012 14:23:50 +0000 (16:23 +0200)]
Wireshark dissector updated, work with the current development tree of wireshark. The way I patched it is not really clean, but it can be useful if some people quickly need to inspect ceph network flows.
librbd: replace assign_bid with client id and random number
The assign_bid method has issues with replay because it is a write
that also returns data. This means that the replayed operation would
return success, but no data, and cause a create to fail. Instead, let
the client set the bid based on its global id and a random number.
This only affects the creation of new images, since the bid is put
into an opaque string as part of the object prefix.
Keep the server side assign_bid around in case there are old clients
still using it.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Mon, 23 Jul 2012 23:51:03 +0000 (16:51 -0700)]
osd: fix ACK ordering on resent ops
The wait_for_ondisk handling fixed COMMIT ordering, but the ACKs need to
go back in the same order too. For example:
- op A is queued
- client disconnects, both ACK and COMMIT replies are lost
- client reconnects
- op A and B are sent
- op A is queued
- op B is applied, ACK is sent
- op A and B COMMITs are sent
-> client's ack callbacks will see B and then A.
Fix this by creating a waiting_for_ack queue as well, and sending ACK
responses as needed. Also handle the case where the ACK should be sent
immediately when the retry event is received.
Fixes: #2823 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Mike Ryan <mike.ryan@inktank.com>
Sage Weil [Sat, 21 Jul 2012 16:15:06 +0000 (09:15 -0700)]
crush: fix name map encoding
We screwed up and encoded using the name 'int' type instead of int32_t.
That means people have systems encoding this as both 32 and 64 bit,
depending on their architecture. This could be worse: x86_64 still has a
32-bit int (at least in my environment).
In any case, mixing both word sizes in their clusters is broken as a
result, with the exception of the kernel code, which doesn't decode this
part of the map and will tolerate differently-sized servers.
Fix this by:
* encoding using int32_t now
* decoding either 32-bit or 64-bit values, by assuming that the strings
will always be non-empty. This appears to be the case.
However:
* any cluster with 64-bit ints must upgrade all at once, or else the new
code will start encoding 32-bit values and the old code will be
confused.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Samuel Just [Fri, 20 Jul 2012 00:43:17 +0000 (17:43 -0700)]
OpRequest,OSD: track recent slow ops
This should be helpful while investigating slow performance.
OpRequests now track events with timestamp in addition
to dumping them to the log. OpHistory keeps up to a
configurable number of the slowest ops over a configurable
recent time interval. The admin socket interface for the OSD
now has a dump_historic_ops command which dumps the stored
slow ops.
Reviewed-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Samuel Just <sam.just@inktank.com>
Providing an objclass to create and manipulate advisory
locking. Also providing a client api to control it. A lock
may either be exclusively locked or shared among multiple
lockers. A locker is identified by the rados client name, and
by a cookie-string.
A lock may be assigned with a tag that every operation on that
lock should use. A lock can be unlocked by the client that locked
it, or may be broken by other clients.
When a non-zero lock duration is assigned to a lock by a locker,
that locker expires after that time duration.
A lock may have a description.
Locks on a specific object can be listed. Lockers of a specific
lock can be enumerated (by get_info).
Samuel Just [Fri, 20 Jul 2012 19:00:42 +0000 (12:00 -0700)]
os/HashIndex: use set<pair<string, hobject_t>> rather than multimap
Multimap does not make any guarantees about ordering of different
values with the same key. list_by_hash, however, assumes that
the iterator order matches hobject_t order. Thus, we use
set<pair<string, hobject_t> > to get the proper ordering.
Sage Weil [Fri, 20 Jul 2012 04:27:37 +0000 (21:27 -0700)]
osd: defer boot if heartbeatmap indicates we are unhealthy
If the OSD is bogged down or unresponsive, we should not try to join
the cluster. This was observed on congress (slow/clogged op_tp combined
with osdmap thrashing).
Fixes: #2502 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Sage Weil [Thu, 19 Jul 2012 23:47:23 +0000 (16:47 -0700)]
osd/mon: subscribe (onetime) to pg creations on connect
Ask the monitor for pending pg creations each time we connect.
Normally, this is a freebie check. If there are pending creations, though,
it ensures that the OSD finds out about them even if the original lame
broadcast didn't reach it. Specifically:
- osd is hunting for a monitor, but isn't yet connected
- new pgs are created
- send_pg_creates() sends out create messages, but osd does get it
- osd finally connects to a mon
Fixes: #2151 (tho the bug description is bad) Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Wed, 18 Jul 2012 18:31:09 +0000 (11:31 -0700)]
OSD: actually send queries during handle_pg_create
During the osd threading refactor, we lost the do_queries
call in favor of dispatch_context. However, this did not
include the queries triggered prior to pg instantiation.
Instead, use the rctx to send the queries.
Part of #2771. Without the queries being sent,
can_create_pg will never become true.
Sage Weil [Wed, 18 Jul 2012 19:55:35 +0000 (12:55 -0700)]
objecter: always resend linger registrations
If a linger op (watch) is sent to the OSD and updates the object, and then
the client loses the reply, it will resend the request. The OSD will see
that it is a dup, however, and not set up the in-memory session state for
the watch. This in turn will break the watch (i.e., notifies won't
get delivered).
Instead, always resend linger registration ops, so that we always have a
unique reqid and do the correct session registeration for each session.
* track the tid of the registation op for each LingerOp
* mark registrations ops as should_resend=false; cancel as needed
* when we send a new registration op, cancel the old one to ensure we
ignore the reply. This is needed becuase we resend linger ops on any
pg change, not just a primary change.
* drop the first_send arg to send_linger(), as we can now infer that
from register_tid == 0.
The bug was easily reproduced with ms inject socket failures = 500 and the
test_stress_watch utility.
Fixes: #2796 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Samuel Just [Wed, 18 Jul 2012 16:26:11 +0000 (09:26 -0700)]
OSD: publish_map in init to initialize OSDService map
Other areas rely on OSDService::get_map() to function, possibly before
activate_map is first called. In particular, with handle_osd_ping,
not initializing the map member results in:
ceph version 0.48argonaut-413-g90ddc5a (commit:90ddc5ae51627e7656459085d7e15105c8b8316d)
1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x71ba9a]
2: (()+0xfcb0) [0x7fcd8243dcb0]
3: (OSD::handle_osd_ping(MOSDPing*)+0x74d) [0x5dbdfd]
4: (OSD::heartbeat_dispatch(Message*)+0x22b) [0x5dc70b]
5: (SimpleMessenger::DispatchQueue::entry()+0x92b) [0x7b5b3b]
6: (SimpleMessenger::dispatch_entry()+0x24) [0x7b6914]
7: (SimpleMessenger::DispatchThread::entry()+0xd) [0x7762fd]
8: (()+0x7e9a) [0x7fcd82435e9a]
9: (clone()+0x6d) [0x7fcd809ea4bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Samuel Just [Tue, 17 Jul 2012 23:20:38 +0000 (16:20 -0700)]
OSD: handle_osd_ping: use service->get_osdmap()
This way, we avoid grabbing the map_lock. Furthermore,
get curmap at the beginning of the method to ensure that
we send the message using the same map used to check
is_up.
This should also fix #2798, which was caused by
an osd being marked up between service.get_osdmap()
and OSD::osdmap.
Sage Weil [Tue, 17 Jul 2012 19:38:40 +0000 (12:38 -0700)]
osd: default 'osd_preserve_trimmed_log = false'
This option makes the osd skip zeroing old trimmed regions of the log. The
data is never read, since the xattrs indicate which part of the log is
valid. We've never actually used this to debug a problem, and it consumes
space, so let's disable it.
Below is a patch which makes the ceph-rbdnamer script more robust and
fixes a problem with the rbd udev rules.
On our setup we encountered a symlink which was linked to the wrong rbd:
/dev/rbd/mypool/myrbd -> /dev/rbd1
While that link should have gone to /dev/rbd3 (on which a
partition /dev/rbd3p1 was present).
Now the old udev rule passes %n to the ceph-rbdnamer script, the problem
with %n is that %n results in a value of 3 (for rbd3), but in a value of
1 (for rbd3p1), so it seems it can't be depended upon for rbdnaming.
In the patch below the ceph-rbdnamer script is made more robust and it
now it can be called in various ways:
Even with all these different styles of calling the modified script, it
should now return the same rbdname. This change "has" to be combined
with calling it from udev with %k though.
With that fixed, we hit the second problem. We ended up with:
/dev/rbd/mypool/myrbd -> /dev/rbd3p1
So the rbdname was symlinked to the partition on the rbd instead of the
rbd itself. So what probably went wrong is udev discovering the disk and
running ceph-rbdnamer which resolved it to myrbd so the following
symlink was created:
/dev/rbd/mypool/myrbd -> /dev/rbd3
However partitions would be discovered next and ceph-rbdnamer would be
run with rbd3p1 (%k) as parameter, resulting in the name myrbd too, with
the previous correct symlink being overwritten with a faulty one:
/dev/rbd/mypool/myrbd -> /dev/rbd3p1
The solution to the problem is in differentiating between disks and
partitions in udev and handling them slightly differently. So with the
patch below partitions now get their own symlinks in the following style
(which is fairly consistent with other udev rules):
/dev/rbd/mypool/myrbd-part1 -> /dev/rbd3p1
Please let me know any feedback you have on this patch or the approach
used.
Regards,
Pascal de Bruijn
Unilogic B.V.
Signed-off-by: Pascal de Bruijn <pascal@unilogicnetworks.net> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Sage Weil [Tue, 10 Jul 2012 01:16:44 +0000 (18:16 -0700)]
mkcephfs: error out if mon data directory is not empty
The ceph-mon --mkfs function no longer wipes out the directory; it is in
fact mostly a no-op that just verifies the dir exists.
So, ensure that the directory is empty at mkfs time. This could
alternatively do an 'rm -r' in that directory (that is in fact what
ceph-mon used to do), but this is safer.
Sage Weil [Mon, 16 Jul 2012 23:02:14 +0000 (16:02 -0700)]
log: apply log_level to stderr/syslog logic
In non-crash situations, we want to make sure the message is both below the
syslog/stderr threshold and also below the normal log threshold. Otherwise
we get anything we gather on those channels, even when the log level is
low.
Samuel Just [Mon, 16 Jul 2012 20:07:56 +0000 (13:07 -0700)]
PG: use stats from primary after rewinding divergent entries
If the osd recieving the info has divergent entries, it will
also have a "divergent" stat structure.
Probably fixes #2769.
In cases like #2769, this bug can result in a primary with a stat
structure which double counts an operation: once for the
divergent operation, and once for the replay.
Because we don't clear the scrub state before reseting info,
the last_scrub_stamp state in the info.history structure
changes without updating the osd state resulting in the
above assert failure.
Samuel Just [Fri, 22 Jun 2012 17:11:38 +0000 (10:11 -0700)]
PG: Place info in biginfo object
The purged_snaps set can grow without bound as snaps are
created and removed. Because the filestore doesn't
provide unlimited size collection attributes, it's better
to place the full info on the biginfo object, since we
need to write it during write_info anyway.
Added CEPH_OSD_FEATURE_INCOMPAT_BIGINFO to prevent downgrade.
Samuel Just [Fri, 29 Jun 2012 20:39:49 +0000 (13:39 -0700)]
PG: use write_info to set snap_collections in make_snap_collections
At one point, snap_collections were written to a pg collection
attribute. Subsequently, they were moved to the biginfo object
since the structure can grow too large for limited size xattrs.
make_snap_collection, however, was not updated.
Using write_info here should prevent this from happening in
the future.
Samuel Just [Fri, 13 Jul 2012 23:44:33 +0000 (16:44 -0700)]
OSD: set superblock compat_features on boot and mkfs
Previously, we did not actually persist the osd compatibility
mask. Without persisting the current compat mask, a previous,
incompatible version of the OSD would not be prevented from
starting on the same store.
Samuel Just [Fri, 13 Jul 2012 21:23:27 +0000 (14:23 -0700)]
CompatSet: users pass bit indices rather than masks
CompatSet users number the Feature objects rather than
providing masks. Thus, we should do
mask |= (1 << f.id) rather than mask |= f.id.
In order to detect old, broken encodings, the lowest
bit will be set in memory but not set in the encoding.
We can reconstruct the correct mask from the names map.
This bug can cause an incompat bit to not be detected
since 1|2 == 1|2|3.
Sage Weil [Sat, 14 Jul 2012 21:31:34 +0000 (14:31 -0700)]
osd: based misdirected op role calc on acting set
We want to look at the acting set here, nothing else. This was causing us
to erroneously queue ops for later (wasting memory) and to erroneously
print out a 'misdrected op' message in the cluster log (confusion and
incorrect [but ignored] -ENXIO reply).
Fixes: #2022 Signed-off-by: Sage Weil <sage@inktank.com>