Dan Mick [Fri, 26 Apr 2013 07:04:13 +0000 (00:04 -0700)]
debian/rules: use multiline search to look for Build-Depends
When Build-Depends was split into multiple lines (in commit 8f5c665744e58d6d51a1e86de55c1399f51cc1c3), the grep for
libgoogle-perftools-dev broke. Replace grep with perl for multiline
matching.
Sam Lang [Thu, 25 Apr 2013 23:52:06 +0000 (18:52 -0500)]
client: don't embed cap releases in clientreplay
If the client is sending replay requests, avoid sending embedded caps,
since the mds already has the client's caps from the reconnect.
This matches the behavior of the kernel client.
Fixes #4742. Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Samuel Just [Thu, 25 Apr 2013 21:08:57 +0000 (14:08 -0700)]
PG: clear want_acting when we leave Primary
This is somewhat annoying actually. Intuitively we want to
clear_primary_state when we leave primary, but when we restart
peering due to a change in prior set status, we can't afford
to forget most of our peering state. want_acting, on the
other hand, should never persist across peering attempts.
In fact, in the future, want_acting should be pulled into
the Primary state structure.
Fixes: #3904 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
Sage Weil [Thu, 25 Apr 2013 22:18:42 +0000 (15:18 -0700)]
mon: get own entity_inst_t via messenger, not monmap
There are intervals during bootstrap(*) during which we are part of the
monmap, but our name (mon->name) does not match the monmap's. This means
that calling monmap->get_inst(mon->name) is not a safe way to get our own
entity_inst_t.
Instead, use messenger->get_myinst(). This includes our addr (obviously)
and an up-to-date entity_name_t, too: in bootstrap we adjust the messenger
name at the same time as mon->rank, based on the contents of the monmap.
monmap->get_inst(mon->rank) would work too.
* During mkfs, the monmap may have noname-foo instead of the name if it was
generated from the mon_host lines or dns or whatever by
MonMap::build_initial(). This was the case for #4811.
Fixes: #4811 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Li Wang [Thu, 25 Apr 2013 15:36:56 +0000 (23:36 +0800)]
radosgw: receiving unexpected error code while accessing an non-existing object by authorized not-owner user
This patch fixes a bug in radosgw swift compatibility code,
that is, if a not-owner but authorized user access a non-existing
object in a container, he wiil receive unexpected error code,
to repeat this bug, do the following steps,
1 User1 creates a container, and grants the read/write permission to user2
Sage Weil [Thu, 25 Apr 2013 18:13:33 +0000 (11:13 -0700)]
init-ceph: use remote config when starting daemons on remote nodes (-a)
If you use -a to start a remote daemon, assume the remote config is present
instead of pushing the local config. This makes more sense and simplifies
things.
Note that this means that -a in concert with -c foo means that foo must
also be present on the remote node in the same path. That, however, is a
use case that I don't particularly care about right now. :)
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
David Zafman [Tue, 23 Apr 2013 00:06:52 +0000 (17:06 -0700)]
scrub clears inconsistent flag set by deep scrub
Add new num_deep_scrub_errors and num_shallow_scrub_errors to object_stat_sum_t
Show deep-scrub error count when outputing regular scrub errors
Set invalid size in case of a stat error which sets read_error
For now do deep-scrub after repair (see #4783)
fixes: #4778 Signed-off-by: David Zafman <david.zafman@inktank.com>
ObjectCacher: remove all buffers from a non-existent object
Once we're sure an object doesn't exist, we retry all the waiters in
order, and they return -ENOENT immediately. If there were a bunch of
BufferHeads waiting for data (rx state), they would be left behind
while the reads that triggered them were complete from the cache
user's perspective. These extra rx BufferHeads would pin the object in
the lru, so they wouldn't be removed by release_set(). This meant that
the assert during shutdown of the cache would be triggered.
To fix this, remove any BufferHeads in this state immediately when we
find out the object doesn't exist. Use the same condition as readx for
determining whether this is safe - if we got -ENOENT and all
BufferHeads for the object are clean or rx.
mon: when electing, be sure acked leaders have new enough stores to lead
In general anybody participating in an election should be new enough to
lead thanks to the bootstrap process, but we've observed situations in
which a monitor is leader but gets so busy that it gets booted out
without noticing for a while, then processes the election messages
which were spawned, responds to them, and the other monitors kick those
up to a new election epoch. Then the old and behind monitor gets
elected as the new leader, which does bad things to our sync.
To deal with this, add the paxos first and last committed versions
to the MMonElection messages, and consider those values when deciding
whether to defer to a peer. Only defer to them if their newest value
is newer than our oldest, but also *do* defer to them if their oldest
value is newer than our newest even if we out-rank them otherwise.
mon: be more careful about making sure we're up-to-date on sync check
We were looking at our own paxos_max_join_drift and using that to
calculate whether we were new enough to join without syncing, but
if those numbers don't match across monitors they might have trimmed. Use
the number they provide us as their first version and compare to that
as well.
When calculating the [a,b] interval over which a given clone is valid, do
not assume that b == the clone id; that is *not* true when the original
end snap has been deleted/trimmed.
While we are here, make the code a bit cleaner to read.
Fixes: #4785 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Wed, 24 Apr 2013 19:20:17 +0000 (12:20 -0700)]
PG: call check_recovery_sources in remove_down_peer_info
If we transition out of peering due to affected
prior set, we won't trigger start_peering_interval
and check_recovery_sources won't get called. This
will leave an entry in missing_loc_sources without
a matching missing set. We always want to
check_recovery_sources with remove_down_peer_info.
Fixes: 4805
Backport: bobtail Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Wed, 24 Apr 2013 17:13:40 +0000 (10:13 -0700)]
mkcephfs: give mon. key 'allow *' mon caps
This will ease the transition from mkcephfs to ceph-deploy by allowing
ceph-create-keys to use the mon. keyring file in $mon_data and get the
caps it needs.
Fixes: #4756 Signed-off-by: Sage Weil <sage@inktank.com>
Fixes: #4759
Add a new request param 'stats' for the swift list containers
request. If set to 'false' it disables stats retrieval, which
makes it go faster. Also, don't dump stats if format is plain,
as they're not going to be dumped.
Sage Weil [Wed, 24 Apr 2013 00:16:31 +0000 (17:16 -0700)]
mon: revert part of PaxosService::is_readable() change
In 98e23980f4ab7ba289303f72da06721c84767293 is_readable() was changed to
call is_active(), but that has a check for is_bootstrapping(), so there is
a semantic change.
As a result, we may fail PaxosService::is_readable() (due to bootstrapping)
and then try to call Paxos::wait_for_readable(). That will assert that
Paxos::is_readable() is false, but it will be true and we will crash.
Revert that part of the change, since the semantic change was not
intentional.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Sage Weil [Tue, 23 Apr 2013 21:58:55 +0000 (14:58 -0700)]
librbd: add read_iterate2 call with fixed argument type
The existing read_iterate takes a size_t for the length, which is only 4GB
on 32-bit machines. Instead, take a uint64_t length for the new
read_iterate2().
Return 0 instead of the number of bytes read; this makes the user-facing
API a bit simpler.
Fixes: #4665 Signed-off-by: Sage Weil <sage@inktank.com>
keep bytes return from internal method
Sage Weil [Tue, 23 Apr 2013 22:44:42 +0000 (15:44 -0700)]
librbd: implement read not in terms of read_iterate
The read() method returns the bytes read, trimmed to the end of the image;
use the other read() variant to do this (which use aio_read()) instead of
read_iterate().
Sage Weil [Tue, 23 Apr 2013 21:06:41 +0000 (14:06 -0700)]
mon: drop forwarded requests after an election
On each election, we resend routed requests to the new leader (or
requeue for ourselves). Therefore, if we receive a forwarded request,
we should drop it on the floor if there is a new election. Add a field
in the PaxosServiceMessage struct to track which election epoch we
received the request in, and drop it in PaxosService::dispatch() if
that is in the past.
Sage Weil [Tue, 23 Apr 2013 20:45:59 +0000 (13:45 -0700)]
mon: requeue routed_requests for self if elected leader
If we have requests that we have forwarded, and are elected leader,
requeue those requests for ourself and queue them normally and clear out
the routed_requests map.
Samuel Just [Tue, 23 Apr 2013 19:08:14 +0000 (12:08 -0700)]
test_filejournal: adjust corrupt entry tests to force header write
The journal no longer assumes corruption if it finds a valid entry
after an inavlid entry. Instead, these tests will exercise the
corruption detection via the header committed_up_to member.
Fixes: #4792 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Fixes: #4760
Instead of retrieving the entire list of buckets in one
chunk, streamline it. This makes it so that if the request
takes too long, client isn't going to timeout before getting
any data.
Sage Weil [Tue, 23 Apr 2013 17:00:38 +0000 (10:00 -0700)]
init-ceph: fix (and simplify) pushing ceph.conf to remote unique name
The old code would only do the push once per remote node (due to the
list in $pushed_to) but would reset $unique on each attempt. This would
break if a remote host was processed twice.
Fix by just skipping the $pushed_to optimization entirely.
Fixes: #4794 Reported-by: Andreas Friedrich <andreas.friedrich@ts.fujitsu.com> Signed-off-by: Sage Weil <sage@inktank.com>
Gary Lowell [Thu, 11 Apr 2013 16:42:13 +0000 (09:42 -0700)]
ceph-disk: OSD hotplug fixes for Centos
Two fixes for Centos 6.3 and other systems with udev versions
prior to 172. The disk peristant name using the GPT UUID does
not exist, so use the by_path persistent name instead for the
journal symlink.
The gpt label fields are not available for use in udev rules. Add
ceph-disk-udev wrapper script that extracts the partition
type guid from the label and calls ceph-disk-activate if it is
a ceph guid type. (Bug #4632)
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
mon: PaxosService: add request_proposal() to perform cross-proposals
Instead of allowing services to directly use 'propose_pending()' on
other services, we instead add two new functions:
- request_proposal() to request 'this' service to propose its
pending value; and
- request_proposal(PaxosService *other) so that 'this' service
can request a proposal to 'other'
These functions should allow us to enforce a greater set of
constraints at time of a cross-proposal, either by making sure a
service will (e.g.) hold-off his own proposals until said proposal
is performed, or even that the other service will enforce a tighter
set of constraints that wouldn't otherwise be enforced by using
'propose_pending()' directly.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Mon, 22 Apr 2013 22:01:09 +0000 (15:01 -0700)]
mon: commit LogSummary on every message
This moves our version pointer up so that we don't re-log (by re-consuming)
log messages to /var/log/ceph/ceph.log on ceph-mon restart. OTOH, it means
we rewrite the summary of the last 50 messages, but we consider that to be
relatively cheap (and something we *always* did prior for bobtail and
earlier anyway).
ceph-mon: Attempt to obtain monmap from several possible sources
In order of interest/priority:
- our latest monmap version
- a backup monmap version created during sync start, if the store
appears to be in a post-aborted sync state
- a mkfs monmap version
If none of these are found, we should go ahead and try to build a
monmap from ceph.conf to join an existing cluster.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
mon: Monitor: backup monmap prior to starting a store sync
If by fate we end up attempting a store sync after failing at
least one before, we might not have a monmap to read from the
store to backup. Therefore, in that case, we shall backup the
current monmap being used by the monitor.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
David Zafman [Wed, 20 Mar 2013 06:12:35 +0000 (23:12 -0700)]
tools/ceph-filestore-dump: Error messages lost because stderr is closed
Use cout instead of cerr for command errors
Use cerr for debug mode because stderr is avail
Output map_epoch in debug mode
Fix a message and only for debug mode
Signed-off-by: David Zafman <david.zafman@inktank.com>
With OSD sharing data and journal, the previous code created the
journal partiton from the end of the device. A uint32_t is
used in sgdisk to get the last sector, with large HD, uint32_t
is too small.
The journal partition will be created backwards from the
a sector in the midlle of the disk leaving space before
and after it. The data partition will use whichever of
these spaces is greater. The remaining will not be used.
This patch creates the journal partition from the start as a workaround.