librbd: move flush on new snap outside of snap_lock
snap_lock needs to be taken during writeback.
This is still protected by md_lock. The altered snapc doesn't
affect in-flight ops, so it's safe to update it before flushing.
Move into a separate class that requires layering to be enabled,
so the common step of creating and deleting a clone doesn't
need to be repeated in each test.
Move flatten tests into a subclass so they can be run separately
more easily.
Move the checks for the layering feature into a generic decorator
that skips tests if the specified feature is not being used.
Put the completion handling logic into new subclases of
librbd::AioRequest, so the caching/non-caching paths can share
logic. These AioRequests replace AioBlockCompletion as representing
the I/O to a single object in an RBD image.
Write in terms of the asynchronous functions, so all the logic
is not duplicated. Now there's only a single point where each
operation needs to change for layering.
librbd: make ImageCtx methods take snap_id parameters
This makes it easier to use without racing with snap_set.
Requests in the cache, for example, store their snap_id
and may not be sent for a long time after being created.
Also rename the parameters for these methods so they
don't alias member variables.
Run on old and new style images, with different features. This is
intended to ease development, as opposed to being part of the qa
suite. It should be run from the src directory.
Extract a helper out of get_parent_info. The parent may become unset
while the child is open, so detect changes in it during ictx_refresh().
Don't watch the parent image, since we only care about the read-only
snapshot the child references, which cannot change.
librbd: allow an image to be opened without watching
Watching the header of a parent image could produce unreasonable
delays. If hundreds of child images watch the same parent, taking a
snapshot or resizing the parent would wait until all the children are
notified. Since the children are based on snapshots, they don't care
about any changes to the current version of the parent image, and
don't need to re-read the header on each change. Nothing children need
to access their parent snapshot will change.
cls_rbd: make get_parent return valid data when layering is disabled
This means clients can treat an error in their multi-object
transaction as a failure for all of them. This makes the client side
much simpler since it can call get_parent on images that don't support
layering, or do support it but don't have parents, and not need to
check the return value of every operation in the transaction.
Samuel Just [Wed, 18 Jul 2012 18:31:09 +0000 (11:31 -0700)]
OSD: actually send queries during handle_pg_create
During the osd threading refactor, we lost the do_queries
call in favor of dispatch_context. However, this did not
include the queries triggered prior to pg instantiation.
Instead, use the rctx to send the queries.
Part of #2771. Without the queries being sent,
can_create_pg will never become true.
Sage Weil [Wed, 18 Jul 2012 19:55:35 +0000 (12:55 -0700)]
objecter: always resend linger registrations
If a linger op (watch) is sent to the OSD and updates the object, and then
the client loses the reply, it will resend the request. The OSD will see
that it is a dup, however, and not set up the in-memory session state for
the watch. This in turn will break the watch (i.e., notifies won't
get delivered).
Instead, always resend linger registration ops, so that we always have a
unique reqid and do the correct session registeration for each session.
* track the tid of the registation op for each LingerOp
* mark registrations ops as should_resend=false; cancel as needed
* when we send a new registration op, cancel the old one to ensure we
ignore the reply. This is needed becuase we resend linger ops on any
pg change, not just a primary change.
* drop the first_send arg to send_linger(), as we can now infer that
from register_tid == 0.
The bug was easily reproduced with ms inject socket failures = 500 and the
test_stress_watch utility.
Fixes: #2796 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Samuel Just [Wed, 18 Jul 2012 16:26:11 +0000 (09:26 -0700)]
OSD: publish_map in init to initialize OSDService map
Other areas rely on OSDService::get_map() to function, possibly before
activate_map is first called. In particular, with handle_osd_ping,
not initializing the map member results in:
ceph version 0.48argonaut-413-g90ddc5a (commit:90ddc5ae51627e7656459085d7e15105c8b8316d)
1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x71ba9a]
2: (()+0xfcb0) [0x7fcd8243dcb0]
3: (OSD::handle_osd_ping(MOSDPing*)+0x74d) [0x5dbdfd]
4: (OSD::heartbeat_dispatch(Message*)+0x22b) [0x5dc70b]
5: (SimpleMessenger::DispatchQueue::entry()+0x92b) [0x7b5b3b]
6: (SimpleMessenger::dispatch_entry()+0x24) [0x7b6914]
7: (SimpleMessenger::DispatchThread::entry()+0xd) [0x7762fd]
8: (()+0x7e9a) [0x7fcd82435e9a]
9: (clone()+0x6d) [0x7fcd809ea4bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Samuel Just [Tue, 17 Jul 2012 23:20:38 +0000 (16:20 -0700)]
OSD: handle_osd_ping: use service->get_osdmap()
This way, we avoid grabbing the map_lock. Furthermore,
get curmap at the beginning of the method to ensure that
we send the message using the same map used to check
is_up.
This should also fix #2798, which was caused by
an osd being marked up between service.get_osdmap()
and OSD::osdmap.
Sage Weil [Tue, 17 Jul 2012 19:38:40 +0000 (12:38 -0700)]
osd: default 'osd_preserve_trimmed_log = false'
This option makes the osd skip zeroing old trimmed regions of the log. The
data is never read, since the xattrs indicate which part of the log is
valid. We've never actually used this to debug a problem, and it consumes
space, so let's disable it.
Below is a patch which makes the ceph-rbdnamer script more robust and
fixes a problem with the rbd udev rules.
On our setup we encountered a symlink which was linked to the wrong rbd:
/dev/rbd/mypool/myrbd -> /dev/rbd1
While that link should have gone to /dev/rbd3 (on which a
partition /dev/rbd3p1 was present).
Now the old udev rule passes %n to the ceph-rbdnamer script, the problem
with %n is that %n results in a value of 3 (for rbd3), but in a value of
1 (for rbd3p1), so it seems it can't be depended upon for rbdnaming.
In the patch below the ceph-rbdnamer script is made more robust and it
now it can be called in various ways:
Even with all these different styles of calling the modified script, it
should now return the same rbdname. This change "has" to be combined
with calling it from udev with %k though.
With that fixed, we hit the second problem. We ended up with:
/dev/rbd/mypool/myrbd -> /dev/rbd3p1
So the rbdname was symlinked to the partition on the rbd instead of the
rbd itself. So what probably went wrong is udev discovering the disk and
running ceph-rbdnamer which resolved it to myrbd so the following
symlink was created:
/dev/rbd/mypool/myrbd -> /dev/rbd3
However partitions would be discovered next and ceph-rbdnamer would be
run with rbd3p1 (%k) as parameter, resulting in the name myrbd too, with
the previous correct symlink being overwritten with a faulty one:
/dev/rbd/mypool/myrbd -> /dev/rbd3p1
The solution to the problem is in differentiating between disks and
partitions in udev and handling them slightly differently. So with the
patch below partitions now get their own symlinks in the following style
(which is fairly consistent with other udev rules):
/dev/rbd/mypool/myrbd-part1 -> /dev/rbd3p1
Please let me know any feedback you have on this patch or the approach
used.
Regards,
Pascal de Bruijn
Unilogic B.V.
Signed-off-by: Pascal de Bruijn <pascal@unilogicnetworks.net> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Sage Weil [Tue, 10 Jul 2012 01:16:44 +0000 (18:16 -0700)]
mkcephfs: error out if mon data directory is not empty
The ceph-mon --mkfs function no longer wipes out the directory; it is in
fact mostly a no-op that just verifies the dir exists.
So, ensure that the directory is empty at mkfs time. This could
alternatively do an 'rm -r' in that directory (that is in fact what
ceph-mon used to do), but this is safer.
Sage Weil [Mon, 16 Jul 2012 23:02:14 +0000 (16:02 -0700)]
log: apply log_level to stderr/syslog logic
In non-crash situations, we want to make sure the message is both below the
syslog/stderr threshold and also below the normal log threshold. Otherwise
we get anything we gather on those channels, even when the log level is
low.
Samuel Just [Mon, 16 Jul 2012 20:07:56 +0000 (13:07 -0700)]
PG: use stats from primary after rewinding divergent entries
If the osd recieving the info has divergent entries, it will
also have a "divergent" stat structure.
Probably fixes #2769.
In cases like #2769, this bug can result in a primary with a stat
structure which double counts an operation: once for the
divergent operation, and once for the replay.
Because we don't clear the scrub state before reseting info,
the last_scrub_stamp state in the info.history structure
changes without updating the osd state resulting in the
above assert failure.
Samuel Just [Fri, 22 Jun 2012 17:11:38 +0000 (10:11 -0700)]
PG: Place info in biginfo object
The purged_snaps set can grow without bound as snaps are
created and removed. Because the filestore doesn't
provide unlimited size collection attributes, it's better
to place the full info on the biginfo object, since we
need to write it during write_info anyway.
Added CEPH_OSD_FEATURE_INCOMPAT_BIGINFO to prevent downgrade.
Samuel Just [Fri, 29 Jun 2012 20:39:49 +0000 (13:39 -0700)]
PG: use write_info to set snap_collections in make_snap_collections
At one point, snap_collections were written to a pg collection
attribute. Subsequently, they were moved to the biginfo object
since the structure can grow too large for limited size xattrs.
make_snap_collection, however, was not updated.
Using write_info here should prevent this from happening in
the future.
Samuel Just [Fri, 13 Jul 2012 23:44:33 +0000 (16:44 -0700)]
OSD: set superblock compat_features on boot and mkfs
Previously, we did not actually persist the osd compatibility
mask. Without persisting the current compat mask, a previous,
incompatible version of the OSD would not be prevented from
starting on the same store.
Samuel Just [Fri, 13 Jul 2012 21:23:27 +0000 (14:23 -0700)]
CompatSet: users pass bit indices rather than masks
CompatSet users number the Feature objects rather than
providing masks. Thus, we should do
mask |= (1 << f.id) rather than mask |= f.id.
In order to detect old, broken encodings, the lowest
bit will be set in memory but not set in the encoding.
We can reconstruct the correct mask from the names map.
This bug can cause an incompat bit to not be detected
since 1|2 == 1|2|3.
Sage Weil [Sat, 14 Jul 2012 21:31:34 +0000 (14:31 -0700)]
osd: based misdirected op role calc on acting set
We want to look at the acting set here, nothing else. This was causing us
to erroneously queue ops for later (wasting memory) and to erroneously
print out a 'misdrected op' message in the cluster log (confusion and
incorrect [but ignored] -ENXIO reply).
Fixes: #2022 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Mon, 16 Jul 2012 03:30:34 +0000 (20:30 -0700)]
mon/MonitorStore: always O_TRUNC when writing states
It is possible for a .new file to already exist, potentially with a
larger size. This would happen if:
- we were proposing a different value
- we crashed (or were stopped) before it got renamed into place
- after restarting, a different value was proposed and accepted.
This isn't so unlikely for the log state machine, where we're
aggregating random messages. O_TRUNC ensure we avoid getting the tail
end of some previous junk.
I observed #2593 and found that a logm state value had a larger size on
one mon (after slurping) than the others, pointing to put_bl_sn_map().
While we are at it, O_TRUNC put_int() too; the same type of bug is
possible there, too.
Fixes: #2593 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Mon, 16 Jul 2012 05:03:31 +0000 (22:03 -0700)]
mon: remove osds from [near]full sets when their stats are removed from pgmap
Greg points out that we could have a situation like:
- mon recovers..
- goes through osdmaps, notes an osd was removed and removes from
full/nearfull
- goes through pgmaps, and re-adds it when it encounters some osd_stat_ts.
Fix this by removing the osd from the full/nearfull set when we remove
the osd_stat_t from the pgmap. Any osd removal is always followed by
an osd_stat_rm[] record when the primary processes the new osdmap and
proposed the appropriate pgmap updates.
Sage Weil [Mon, 16 Jul 2012 03:30:34 +0000 (20:30 -0700)]
mon/MonitorStore: always O_TRUNC when writing states
It is possible for a .new file to already exist, potentially with a
larger size. This would happen if:
- we were proposing a different value
- we crashed (or were stopped) before it got renamed into place
- after restarting, a different value was proposed and accepted.
This isn't so unlikely for the log state machine, where we're
aggregating random messages. O_TRUNC ensure we avoid getting the tail
end of some previous junk.
I observed #2593 and found that a logm state value had a larger size on
one mon (after slurping) than the others, pointing to put_bl_sn_map().
While we are at it, O_TRUNC put_int() too; the same type of bug is
possible there, too.
Fixes: #2593 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Sun, 15 Jul 2012 22:21:57 +0000 (15:21 -0700)]
filestore: dump open fds when we hit EMFILE
Use a helper to dump /proc/self/fd when we hit EMFILE in the filestore.
Ideally, we should trigger this in other appropriate places, but it is
not immediately clear that there is a sane way to do that.
Fixes: #2330 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Sat, 14 Jul 2012 21:32:28 +0000 (14:32 -0700)]
osdmap: drop useless and unused get_pg_role() method
Users probably want get_pg_acting_rank(). If they don't, they can probably
have the mapping and can calculate the rank themselves. Having this here
is asking for bugs like #2022.
Sage Weil [Sat, 14 Jul 2012 21:31:34 +0000 (14:31 -0700)]
osd: based misdirected op role calc on acting set
We want to look at the acting set here, nothing else. This was causing us
to erroneously queue ops for later (wasting memory) and to erroneously
print out a 'misdrected op' message in the cluster log (confusion and
incorrect [but ignored] -ENXIO reply).
Fixes: #2022 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Sat, 14 Jul 2012 21:29:29 +0000 (14:29 -0700)]
osd: simplify helper usage for misdirected ops
Make the helper exclusively for the PG != NULL cases, and open-code the
one PG == NULL caller. This is simpler, and lets us include more useful
information in the log message.