Josh Durgin [Sat, 9 Mar 2013 02:57:24 +0000 (18:57 -0800)]
librbd: invalidate cache when flattening
The cache stores which objects don't exist. Flatten bypasses the cache
when doing its copyups, so when it is done the -ENOENT from the cache
is treated as zeroes instead of 'need to read from parent'.
Clients that have the image open need to forgot about the cached
non-existent objects as well. Do this during ictx_refresh, while the
parent_lock is held exclusively so no new reads from the parent can
happen until the updated parent metadata is visible, so no new reads
from the parent will occur.
Josh Durgin [Sat, 9 Mar 2013 01:53:31 +0000 (17:53 -0800)]
ObjectCacher: add a method to clear -ENOENT caching
Clear the exists and complete flags for any objects that have exists
set to false, and force any in-flight reads to retry if they get
-ENOENT instead of generating zeros.
This is useful for getting the cache into a consistent state for rbd
after an image has been flattened, since many objects which previously
did not exist and went up to the parent to retrieve data may now exist
in the child.
Josh Durgin [Sat, 9 Mar 2013 01:49:27 +0000 (17:49 -0800)]
ObjectCacher: keep track of outstanding reads on an object
Reads always use C_ReadFinish as a callback (and they are the only
user of this callback). Keep an xlist of these for each object, so
they can remove themselves as they finish. To prevent racing requests
and with discard removing objects from the cache, clear the xlist in
the object destructor, so if the Object is still valid the set_item
will still be on the list.
Make the ObjectCacher constructor take an Object* instead of the pool
and object id, which are derived from the Object* anyway.
On second thought, this will require a bit more care to ensure that all
of the paths radosgw needs to read/write from have the correct permissions
in the packages and so forth.
This increase only means that we'll keep more versions around before we
trim. It doesn't change the number of versions we'll keep around after
trimming (that's still as much as 'paxos_max_join_drift', i.e. 10), nor
does it change the criteria used to consider a monitor as having drifted
(same rule applies, 'paxos_max_join_drift').
This change however will enable the leader to put off trimming for a longer
period of time, giving a better chance for a monitor to join the cluster.
See, after going through the probing phase, at which point a monitor may
only be, say, 5 versions off, the same monitor may end up getting into the
quorum only to find that in-between probing and finally triggering an
election some 6 versions might have come to existence. Before this patch,
by then the state had been trimmed and the monitor would have to bootstrap
to perform a full store sync. With this patch in place, the monitor would
be able to sync the remaining 11 versions.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Danny Al-Gaaf [Fri, 1 Mar 2013 19:12:03 +0000 (20:12 +0100)]
mds/Locker.cc: fix warning about 'Possible null pointer dereference'
Fix following warning from cppcheck:
[src/mds/Locker.cc:2255] -> [src/mds/Locker.cc:2258]: (error)
Possible null pointer dereference: in - otherwise it is redundant
to check it against null.
Since head_in used for call pick_inode_snap() is already valid,
there is no need to check if 'CInode *in' is not NULL. Remove
not needed check.
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Dan Mick [Fri, 8 Mar 2013 23:18:54 +0000 (15:18 -0800)]
ceph_common.sh: add warning if 'host' contains dots
This is a common error and there's no reason the script can't
at least tell you it's a really bad idea. One might argue it
could even successfully proactively truncate the host parameter
at the first dot, but that's a little controlling, perhaps.
Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Sage Weil [Fri, 8 Mar 2013 16:56:44 +0000 (08:56 -0800)]
osd: mark down connections from old peers
Close out any connection with an old peer. This avoids a race like:
- peer marked down
- we get map, mark down the con
- they reconnect and try to send us some stuff
- we share our map to tell them they are old and dead, but leave the con
open
...
- peer marks itself up a few times, eventually reuses the same port
- sends messages on their fresh con
- we discard because of our old con
This could cause a tight reconnect loop, but it is better than wrong
behavior.
Other possible fixes:
- make addr nonce truly unique (augment pid in nonce)
- make a smarter 'disposable' msgr state (bleh)
Jan Harkes [Thu, 7 Mar 2013 21:07:26 +0000 (16:07 -0500)]
Properly format Content-Length: header on 32-bit systems.
- Promote len argument in dump_content_length to uint64_t.
- Make sure there is sufficient scratch space to format string.
- Use PRIu64 macro for formatting.
Yehuda Sadeh [Thu, 7 Mar 2013 03:32:21 +0000 (19:32 -0800)]
rgw: don't iterate through all objects when in namespace
Fixes: #4363
Backport: argonaut, bobtail
When listing objects in namespace don't iterate through all the
objects, only go though the ones that starts with the namespace
prefix
Jan Harkes [Fri, 8 Mar 2013 03:10:42 +0000 (22:10 -0500)]
Avoid sending a success response when an error occurs.
Functions called from RGWGetObj_ObjStore_S3::send_response_data may
change the value of the non-local variable 'ret'. But the response
relies on a local 'req_state' which copies ret at the start of the
function.
Right now none of the called functions actually changes ret so the
problem doesn't trigger, but to avoid future breakage it is safer
to not rely on the (early) copy of the ret variable.
Sage Weil [Wed, 6 Mar 2013 03:12:21 +0000 (19:12 -0800)]
mds: pass created_ino back to client on replayed requests
After an MDS restart, the client will resend uncommitted requests. Use the
information we now have in the session_info_t to pass the created ino
back via the extra_bl payload in the reply.
Fixes: #4034 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Fri, 8 Mar 2013 00:33:12 +0000 (16:33 -0800)]
mds: track created inos with completed request ids in session info
Along with each session completed request (tid), also track the created
ino (if any). This will be used to pass back to the client when they
replay requests in the next patch.
Do not bother making this a backward compatible encoding. The only
benefit is to allow ceph-mds code to run and then old mds code to run
after it. Given all of the other incompat changes we *just* made, this
is highly unlikely and not worth the code clutter. If we had spanned
more releases or a stable release the story would be different.
While we are here, inline the second add_completed_request() method variant
since there is only a single caller and having it overloaded somewhat
obscures what is going on. This also avoids a duplicate lookup in the
session map in the have_session() check and then in the (old) helper.
Sage Weil [Sat, 2 Mar 2013 00:28:25 +0000 (16:28 -0800)]
mds: do not do traceless reply for open O_TRUNC requests
Even though these are "may_write" and may be a mutation, the MDS should not
fake out a traceless reply. In a real failure, it looks like:
- mds process request (mutation)
- mds fails
- client resends request
- mds notes that it is O_TRUNC, and already committed, and proceeds with
a regular open
And that is all fine.
But, that means that if we are trying to simulate this behavior without a
failure, then we should never do a traceless reply on an O_TRUNC request,
because the client will never see such a reply--it'll get a full open
request response instead.
Sage Weil [Fri, 1 Mar 2013 06:32:30 +0000 (22:32 -0800)]
client: clear I_COMPLETE on traceless reply for dentry request
If a request is against a dentry, and we get a traceless reply, clear
the directory I_COMPLETE flag on the parent directory because we can no
longer trust that our cache is complete.
It is possible we could do something a bit more intelligent here, but it
is not trivial because of racing requests, and traceless replies are
rare, so it's not worth the effort.
Sage Weil [Fri, 1 Mar 2013 02:07:33 +0000 (18:07 -0800)]
client: handle traceless replies from make_request()
Modify the generic make_request wrapper to retry lookups or getattrs on
requests when the ptarget pointer is specified but there is no trace in
the reply.
Refactor the _create() method to use this, effectively moving all of the
extra_bl code processing into make_request where it is generically
useful.