David Zafman [Mon, 29 Apr 2013 21:36:18 +0000 (14:36 -0700)]
osd: Rename members and methods related to stat publish
pg_stats_lock to pg_stats_publish_lock
pg_stats_valid to pg_stats_publish_valid
pg_stats_stable to pg_stats_publish
update_stats() to publish_stats_to_osd()
clear_stats() to clear_publish_stats()
Signed-off-by: David Zafman <david.zafman@inktank.com>
Sage Weil [Mon, 29 Apr 2013 18:06:36 +0000 (11:06 -0700)]
mon: remap creating pgs on startup
After Monitor::init_paxos() has loaded all of the PaxosService state,
we should then map creating pgs to osds. This ensures we do so after the
osdmap has been loaded and the pgs actually map somewhere meaningful.
Fixes: #4675 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Mon, 29 Apr 2013 18:11:24 +0000 (11:11 -0700)]
mon: only map/send pg creations if osdmap is defined
This avoids calculating new pg creation mappings if the osdmap isn't
loaded yet, which currently happens when during Monitor::paxos_init()
on startup. Assuming osdmap epoch is nonzero, it should always be
safe to do this (although possibly unnecessary).
More cleanup here is certainly possible, but this is one step toward fixing
the bad behavior for #4675.
Sage Weil [Mon, 29 Apr 2013 17:45:31 +0000 (10:45 -0700)]
client: make dup reply a louder error
If we get a dup reply something is probably wrong! We should make sure
it appears more loudly in the log. In particular, it can lead to out
of sync cap state; see #4853.
Sage Weil [Mon, 29 Apr 2013 17:44:28 +0000 (10:44 -0700)]
client: fix session open vs mdsmap race with request kicking
A sequence like:
- ceph-fuse starts, make_request on getattr
- waits for mds to be active
- tries to open a session
- mds restarts, recovers
- eventually gets session open reply
- sends first getattr (even tho mds is in reconnect state)
- gets mdsmap update that mds is now active
- kicks request, resends getattr
- get first reply
- ignore second reply, caps get out of sync
The bug is that we send the first request when the MDS is still in
the reconnect state. The fix is to loop in make_request so that we
ensure all conditions are satisfied before sending the request. Any
time we wait, we loop, so that we know all conditions (still) pass if
we make it to the end.
Fixes: #4853 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Sun, 28 Apr 2013 00:59:24 +0000 (17:59 -0700)]
ceph-filestore-dump: fix warnings on i386 build
tools/ceph-filestore-dump.cc: In member function ‘int header::get_header()’:
warning: tools/ceph-filestore-dump.cc:454:19: comparison between signed and unsigned integer expressions [-Wsign-compare]
tools/ceph-filestore-dump.cc: In member function ‘int footer::get_footer()’:
warning: tools/ceph-filestore-dump.cc:471:19: comparison between signed and unsigned integer expressions [-Wsign-compare]
tools/ceph-filestore-dump.cc: In member function ‘int super_header::read_super()’:
warning: tools/ceph-filestore-dump.cc:697:30: comparison between signed and unsigned integer expressions [-Wsign-compare]
Gary Lowell [Fri, 26 Apr 2013 08:53:08 +0000 (01:53 -0700)]
debian/rules: Fix tcmalloc breakage
Since all currently supported platforms have tcmalloc
available and it is now the default, remove broken check code
that turns it off if the package is not listed in build-depends.
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
Sage Weil [Fri, 26 Apr 2013 19:22:28 +0000 (12:22 -0700)]
mon: cache osd epochs
The monitor may get a series of messages from the OSD that prompt it to
send incremental maps (pg_temp updates, failures, probably more). Avoid
sending the same incremental maps twice by keeping a cache of what epochs
we think the OSDs have.
This reduces monitor load, especially when the mon is a bit behind and is
getting a stream of delayed messages, and the work associated with sending
the inc maps prevents it from catching up.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
There was an issue when limit was being set, we didn't
break from the iterating loop if limit was reached. Also,
S3 does not enforce any limit, so keep that behavior.
Sage Weil [Fri, 26 Apr 2013 17:32:38 +0000 (10:32 -0700)]
mon: mark PaxosServiceMessage forward fields deprecated
These are no longer used; we manage forward state explicitly via the
Monitor sessions instead. Mark them deprecated so we don't accidentally
rely on them. Also, fix the annoying "mon.-1" garbage debug output that
is confusing.
Dan Mick [Fri, 26 Apr 2013 07:04:13 +0000 (00:04 -0700)]
debian/rules: use multiline search to look for Build-Depends
When Build-Depends was split into multiple lines (in commit 8f5c665744e58d6d51a1e86de55c1399f51cc1c3), the grep for
libgoogle-perftools-dev broke. Replace grep with perl for multiline
matching.
Sam Lang [Thu, 25 Apr 2013 23:52:06 +0000 (18:52 -0500)]
client: don't embed cap releases in clientreplay
If the client is sending replay requests, avoid sending embedded caps,
since the mds already has the client's caps from the reconnect.
This matches the behavior of the kernel client.
Fixes #4742. Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Sage Weil [Thu, 25 Apr 2013 23:47:15 +0000 (16:47 -0700)]
mon: do not forward other mon's requests to other mons
The request forwarding infrastructure is there for client requests.
However, we (ab)use it for mon's sending MLog messages: LogClient sends an
MLog message to itself, and that is either handled locally (if leader) or
forwarded to the leader.
If that races with an election, we were forwarding an MLog from another mon
to the leader. This is not necessary; the original MLog sender will resend
the request on election_finish() to the latest leader.
The fix is to adjust forward_request_leader() to only forward messages from
a mon if that mon is itself.
This was reproduced while testing the fix for #4748.
Samuel Just [Thu, 25 Apr 2013 21:08:57 +0000 (14:08 -0700)]
PG: clear want_acting when we leave Primary
This is somewhat annoying actually. Intuitively we want to
clear_primary_state when we leave primary, but when we restart
peering due to a change in prior set status, we can't afford
to forget most of our peering state. want_acting, on the
other hand, should never persist across peering attempts.
In fact, in the future, want_acting should be pulled into
the Primary state structure.
Fixes: #3904 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
Sage Weil [Thu, 25 Apr 2013 22:18:42 +0000 (15:18 -0700)]
mon: get own entity_inst_t via messenger, not monmap
There are intervals during bootstrap(*) during which we are part of the
monmap, but our name (mon->name) does not match the monmap's. This means
that calling monmap->get_inst(mon->name) is not a safe way to get our own
entity_inst_t.
Instead, use messenger->get_myinst(). This includes our addr (obviously)
and an up-to-date entity_name_t, too: in bootstrap we adjust the messenger
name at the same time as mon->rank, based on the contents of the monmap.
monmap->get_inst(mon->rank) would work too.
* During mkfs, the monmap may have noname-foo instead of the name if it was
generated from the mon_host lines or dns or whatever by
MonMap::build_initial(). This was the case for #4811.
Fixes: #4811 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Li Wang [Thu, 25 Apr 2013 15:36:56 +0000 (23:36 +0800)]
radosgw: receiving unexpected error code while accessing an non-existing object by authorized not-owner user
This patch fixes a bug in radosgw swift compatibility code,
that is, if a not-owner but authorized user access a non-existing
object in a container, he wiil receive unexpected error code,
to repeat this bug, do the following steps,
1 User1 creates a container, and grants the read/write permission to user2
Sage Weil [Thu, 25 Apr 2013 18:13:33 +0000 (11:13 -0700)]
init-ceph: use remote config when starting daemons on remote nodes (-a)
If you use -a to start a remote daemon, assume the remote config is present
instead of pushing the local config. This makes more sense and simplifies
things.
Note that this means that -a in concert with -c foo means that foo must
also be present on the remote node in the same path. That, however, is a
use case that I don't particularly care about right now. :)
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
David Zafman [Tue, 23 Apr 2013 00:06:52 +0000 (17:06 -0700)]
scrub clears inconsistent flag set by deep scrub
Add new num_deep_scrub_errors and num_shallow_scrub_errors to object_stat_sum_t
Show deep-scrub error count when outputing regular scrub errors
Set invalid size in case of a stat error which sets read_error
For now do deep-scrub after repair (see #4783)
fixes: #4778 Signed-off-by: David Zafman <david.zafman@inktank.com>
ObjectCacher: remove all buffers from a non-existent object
Once we're sure an object doesn't exist, we retry all the waiters in
order, and they return -ENOENT immediately. If there were a bunch of
BufferHeads waiting for data (rx state), they would be left behind
while the reads that triggered them were complete from the cache
user's perspective. These extra rx BufferHeads would pin the object in
the lru, so they wouldn't be removed by release_set(). This meant that
the assert during shutdown of the cache would be triggered.
To fix this, remove any BufferHeads in this state immediately when we
find out the object doesn't exist. Use the same condition as readx for
determining whether this is safe - if we got -ENOENT and all
BufferHeads for the object are clean or rx.
mon: when electing, be sure acked leaders have new enough stores to lead
In general anybody participating in an election should be new enough to
lead thanks to the bootstrap process, but we've observed situations in
which a monitor is leader but gets so busy that it gets booted out
without noticing for a while, then processes the election messages
which were spawned, responds to them, and the other monitors kick those
up to a new election epoch. Then the old and behind monitor gets
elected as the new leader, which does bad things to our sync.
To deal with this, add the paxos first and last committed versions
to the MMonElection messages, and consider those values when deciding
whether to defer to a peer. Only defer to them if their newest value
is newer than our oldest, but also *do* defer to them if their oldest
value is newer than our newest even if we out-rank them otherwise.
mon: be more careful about making sure we're up-to-date on sync check
We were looking at our own paxos_max_join_drift and using that to
calculate whether we were new enough to join without syncing, but
if those numbers don't match across monitors they might have trimmed. Use
the number they provide us as their first version and compare to that
as well.
When calculating the [a,b] interval over which a given clone is valid, do
not assume that b == the clone id; that is *not* true when the original
end snap has been deleted/trimmed.
While we are here, make the code a bit cleaner to read.
Fixes: #4785 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Wed, 24 Apr 2013 19:20:17 +0000 (12:20 -0700)]
PG: call check_recovery_sources in remove_down_peer_info
If we transition out of peering due to affected
prior set, we won't trigger start_peering_interval
and check_recovery_sources won't get called. This
will leave an entry in missing_loc_sources without
a matching missing set. We always want to
check_recovery_sources with remove_down_peer_info.
Fixes: 4805
Backport: bobtail Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Wed, 24 Apr 2013 17:13:40 +0000 (10:13 -0700)]
mkcephfs: give mon. key 'allow *' mon caps
This will ease the transition from mkcephfs to ceph-deploy by allowing
ceph-create-keys to use the mon. keyring file in $mon_data and get the
caps it needs.
Fixes: #4756 Signed-off-by: Sage Weil <sage@inktank.com>
Fixes: #4759
Add a new request param 'stats' for the swift list containers
request. If set to 'false' it disables stats retrieval, which
makes it go faster. Also, don't dump stats if format is plain,
as they're not going to be dumped.