git.apps.os.sepia.ceph.com Git

]> git.apps.os.sepia.ceph.com Git - ceph.git/log

Yehuda Sadeh [Tue, 18 Sep 2012 20:45:27 +0000 (13:45 -0700)]

cls_rgw: if stats drop below zero, set them to zero

This complements fix for #3127. This is only a band aid
solution for argonaut, the real solution fixes the original
issue that made this possible.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

commit | commitdiff | tree

Yehuda Sadeh [Wed, 12 Sep 2012 23:41:17 +0000 (16:41 -0700)]

cls_rgw: change scoping of suggested changes vars

Fixes: #3127
Bad variable scoping made it so that specific variables
weren't initialized between suggested changes iterations.
This specifically affected a case where in a specific
change we had an updated followed by a remove, and the
remove was on a non-existent key (e.g., was already
removed earlier). We ended up re-substracting the
object stats, as the entry wasn't reset between
the iterations (and we didn't read it because the
key didn't exist).

backport:argonaut
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

commit | commitdiff | tree

Sage Weil [Tue, 4 Sep 2012 18:29:21 +0000 (11:29 -0700)]

objecter: fix osdmap wait

When we get a pool_op_reply, we find out which osdmap we need to wait for.
The wait_for_new_map() code was feeding that epoch into
maybe_request_map(), which was feeding it to the monitor with the subscribe
request. However, that epoch is the *start* epoch, not what we want. Fix
this code to always subscribe to what we have (+1), and ensure we keep
asking for more until we catch up to what we know we should eventually
get.

Bug: #3075
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit e09b26555c6132ffce08b565780a39e4177cbc1c)

commit | commitdiff | tree

Sage Weil [Mon, 27 Aug 2012 14:38:34 +0000 (07:38 -0700)]

objecter: send queued requests when we get first osdmap

If we get our first osdmap and already have requests queued, send them.

Backported from 8d1efd1b829ae50eab7f7f4c07da04e03fce7c45.

Fixes: #3050
Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Wed, 22 Aug 2012 04:12:33 +0000 (21:12 -0700)]

objecter: use ordered map<> for tracking tids to preserve order on resend

We are using a hash_map<> to map tids to Op*'s.  In handle_osd_map(),
we will recalc_op_target() on each Op in a random (hash) order.  These
will get put in a temp map<tid,Op*> to ensure they are resent in the
correct order, but their order on the session->ops list will be random.

Then later, if we reset an OSD connection, we will resend everything for
that session in ops order, which is be incorrect.

Fix this by explicitly reordering the requests to resend in
kick_requests(), much like we do in handle_osd_map().  This lets us
continue to use a hash_map<>, which is faster for reasonable numbers of
requests.  A simpler but slower fix would be to just use map<> instead.

This is one of many bugs contributing to #2947.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 1113a6c56739a56871f01fa13da881dab36a32c4)

commit | commitdiff | tree

Dan Mick [Mon, 20 Aug 2012 22:02:57 +0000 (15:02 -0700)]

rbd: force all exiting paths through main()/return

This properly destroys objects. In the process, remove usage_exit();
also kill error-handling in set_conf_param (never relevant for rbd.cc,
and if you call it with both pointers NULL, well...)
Also switch to EXIT_FAILURE for consistency.

Backported from fed8aea662bf919f35a5a72e4e2a2a685af2b2ed.

Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Fixes: #2948

commit | commitdiff | tree

Josh Durgin [Tue, 18 Sep 2012 16:37:44 +0000 (09:37 -0700)]

rbd: only open the destination pool for import

Otherwise importing into another pool when the default pool, rbd,
doesn't exist results in an error trying to open the rbd pool.

Reported-by: Sébastien Han <han.sebastien@gmail.com>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Tommi Virtanen [Mon, 17 Sep 2012 15:55:14 +0000 (08:55 -0700)]

ceph-disk-activate, upstart: Use "initctl emit" to start OSDs.

This avoids an error if the daemon was running already, and is
already being done with the other services.

Signed-off-by: Tommi Virtanen <tv@inktank.com>

commit | commitdiff | tree

Josh Durgin [Sat, 15 Sep 2012 00:13:57 +0000 (17:13 -0700)]

rbd: make --pool/--image args easier to understand for import

There's no need to set the default pool in set_pool_image_name - this
is done later, in a way that doesn't ignore --pool if --dest-pool
is not specified.

This means --pool and --image can be used with import, just like
the rest of the commands. Without this change, --dest and --dest-pool
had to be used, and --pool would be silently ignored for rbd import.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Tommi Virtanen [Thu, 13 Sep 2012 21:06:04 +0000 (14:06 -0700)]

ceph-create-keys: Create a bootstrap-osd key too.

Signed-off-by: Tommi Virtanen <tv@inktank.com>

commit | commitdiff | tree

Tommi Virtanen [Thu, 13 Sep 2012 18:34:03 +0000 (11:34 -0700)]

ceph-create-keys: Refactor to share wait_for_quorum call.

Signed-off-by: Tommi Virtanen <tv@inktank.com>

commit | commitdiff | tree

Sage Weil [Wed, 12 Sep 2012 18:38:07 +0000 (11:38 -0700)]

objecter: fix skipped map handling

If we skip a map, we want to translate NO_ACTION to NEED_RESEND, but leave
POOL_DNE alone.

Backported from 2a3b7961c021b19a035f8a6cc4fc3cc90f88f367.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Josh Durgin [Mon, 30 Jul 2012 22:19:29 +0000 (15:19 -0700)]

librbd, cls_rbd: close snapshot creation race with old format

If two clients created a snapshot at the same time, the one with the
higher snapshot id might be created first, so the lower snapshot id
would be added to the snapshot context and the snaphot seq would be
set to the lower one.

Instead of allowing this to happen, return -ESTALE if the snapshot id
is lower than the currently stored snapshot sequence number. On the
client side, get a new id and retry if this error is encountered.

Backport: argonaut
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Tommi Virtanen [Tue, 11 Sep 2012 23:31:57 +0000 (16:31 -0700)]

upstart: Give everything a stop on stanza.

These are all tasks, and expected to exit somewhat quickly,
but e.g. ceph-create-keys has a loop where it waits for mon
to reach quorum, so it might still be in that loop when the
machine is shut down.

commit | commitdiff | tree

Tommi Virtanen [Tue, 11 Sep 2012 23:28:41 +0000 (16:28 -0700)]

upstart: Start mds,mon,radosgw after a reboot.

They had no "start on" stanzas, so they didn't get started earlier.

commit | commitdiff | tree

Tommi Virtanen [Tue, 11 Sep 2012 22:31:06 +0000 (15:31 -0700)]

upstart: Add ceph-create-keys.conf to package.

Signed-off-by: Tommi Virtanen <tv@inktank.com>

commit | commitdiff | tree

Sage Weil [Tue, 11 Sep 2012 21:50:53 +0000 (14:50 -0700)]

obsync: if OrdinaryCallingFormat fails, try SubdomainCallingFormat

This blindly tries the Subdomain calling format if the ordinary method
fails. In particular, this works around buckets that present a
PermanentRedirect message.

See bug #3128.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Matthew Wodrich <matthew.wodrich@dreamhost.com>

commit | commitdiff | tree

Sage Weil [Fri, 17 Aug 2012 23:04:20 +0000 (16:04 -0700)]

librbd: add test for discard of nonexistent objects

This verifies librbd properly handles ENOENT during discard.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Josh Durgin [Mon, 10 Sep 2012 20:19:53 +0000 (13:19 -0700)]

librbd: ignore -ENOENT during discard

This is a backport of a3ad98a3eef062e9ed51dd2d1e58c593e12c9703

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Sage Weil [Thu, 16 Aug 2012 01:42:56 +0000 (18:42 -0700)]

objectcacher: fix bh leak on discard

Fixes: #2950
Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Tommi Virtanen [Thu, 30 Aug 2012 14:16:52 +0000 (10:16 -0400)]

upstart, ceph-create-keys: Make client.admin key generation automatic.

This should help simplify Chef etc deployments. Now (when using the
Upstart jobs), when a ceph-mon is started, ceph-create-admin-key is
triggered. If /etc/ceph/$cluster.client.admin.keyring already exists,
it does nothing; otherwise, it waits for ceph-mon to reach quorum, and
then does a "ceph auth get-or-create" to create the key, and writes it
atomically to disk.

The equivalent code can be removed from the Chef cookbook once this is
in.

Signed-off-by: Tommi Virtanen <tv@inktank.com>

commit | commitdiff | tree

Tommi Virtanen [Thu, 30 Aug 2012 14:21:29 +0000 (10:21 -0400)]

config: Add a per-name default keyring to front of keyring search path.

This lets us have e.g. /etc/ceph/ceph.client.admin.keyring that is
owned by root:admin and mode u=rw,g=r,o= without making every non-root
run of the command line tools complain and fail.

This is what the Chef cookbook has been doing for a while already.

Signed-off-by: Tommi Virtanen <tv@inktank.com>

commit | commitdiff | tree

Tommi Virtanen [Thu, 30 Aug 2012 14:11:09 +0000 (10:11 -0400)]

upstart: Make instance jobs export their cluster and id variables.

This allows other jobs listening to Upstart "started ceph-mon" events
to see what instance started.

Signed-off-by: Tommi Virtanen <tv@inktank.com>

commit | commitdiff | tree

Tommi Virtanen [Thu, 12 Jul 2012 17:47:29 +0000 (10:47 -0700)]

upstart: Make ceph-osd always set the crush location.

This used to be conditional on config having osd_crush_location set,
but with that, minimal configuration left the OSD completely out of
the crush map, and prevented the OSD from starting properly.

Note: Ceph does not currently let this mechanism automatically move
hosts to another location in the CRUSH hierarchy. This means if you
let this run with defaults, setting osd_crush_location later will not
take effect. Set up your config file (or Chef environment) fully
before starting the OSDs the first time.

Signed-off-by: Tommi Virtanen <tv@inktank.com>

commit | commitdiff | tree

Tommi Virtanen [Tue, 3 Jul 2012 22:24:26 +0000 (15:24 -0700)]

ceph-disk-prepare: Partition and format OSD data disks automatically.

Uses gdisk, as it seems to be the only tool that can automate GPT uuid
changes. Needs to run as root.

Adds Recommends: gdisk to ceph.deb.

Closes: #2547
Signed-off-by: Tommi Virtanen <tv@inktank.com>

commit | commitdiff | tree

Tommi Virtanen [Tue, 3 Jul 2012 16:22:28 +0000 (09:22 -0700)]

ceph-disk-prepare: Take fsid from config file.

Closes: #2546.
Signed-off-by: Tommi Virtanen <tv@inktank.com>

commit | commitdiff | tree

Tommi Virtanen [Mon, 25 Jun 2012 22:14:33 +0000 (15:14 -0700)]

upstart: fix regex

Signed-off-by: Tommi Virtanen <tv@inktank.com>
Signed-off-by: Greg Farnum <greg@inktank.com>

commit | commitdiff | tree

Yehuda Sadeh [Tue, 28 Aug 2012 23:17:21 +0000 (16:17 -0700)]

rgw: clear usage map before reading usage

Fixes: #3057
Since we read usage in chunks we need to clear the
usage map before reading the next chunk, otherwise
we're going to aggregate the old data as well.

Backport: argonaut
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

commit | commitdiff | tree

Gary Lowell [Thu, 23 Aug 2012 18:48:50 +0000 (11:48 -0700)]

Don't package crush header files.

commit | commitdiff | tree

Yehuda Sadeh [Sat, 18 Aug 2012 00:34:23 +0000 (17:34 -0700)]

rgw: dump content_range using 64 bit formatters

Fixes: #2961
Also make sure that size is 64 bit.

backport: argonaut
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

commit | commitdiff | tree

Sage Weil [Tue, 21 Aug 2012 17:58:38 +0000 (10:58 -0700)]

Revert "rgw: dump content_range using 64 bit formatters"

This reverts commit faf9fa5744b459abc2eda829a48a4e07b9c97a08.

commit | commitdiff | tree

Yehuda Sadeh [Sat, 18 Aug 2012 00:34:23 +0000 (17:34 -0700)]

rgw: dump content_range using 64 bit formatters

Fixes: #2961
Also make sure that size is 64 bit.

backport: argonaut
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Matthew Wodrich [Wed, 1 Aug 2012 02:13:03 +0000 (19:13 -0700)]

obsync: add missing package specifier to format_exc

Fixes: #2873
Signed-off-by: Matthew Wodrich <matthew.wodrich@dreamhost.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>

commit | commitdiff | tree

Danny Kukawka [Thu, 16 Aug 2012 10:56:58 +0000 (12:56 +0200)]

fix keyring generation for mds and osd

    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ANSI_X3.4-1968" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Fix config keys for OSD/MDS data dirs. As in documentation and other
places of the scripts the keys are 'osd data'/'mds data' and not
'osd_data'

In case if MDS: if 'mds data' doesn't exist, create it.

Signed-off-by: Danny Kukawka <danny.kukawka@bisect.de>

commit | commitdiff | tree

Danny Kukawka [Thu, 16 Aug 2012 10:56:32 +0000 (12:56 +0200)]

fix ceph osd create help

    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ANSI_X3.4-1968" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Change ceph osd create <osd-id> to ceph osd create <uuid>, since this
is what the command is really doing.

Signed-off-by: Danny Kukawka <danny.kukawka@bisect.de>

commit | commitdiff | tree

Sage Weil [Tue, 10 Jul 2012 00:24:19 +0000 (17:24 -0700)]

mon: simplify logmonitor check_subs; less noise

* simple helper to translate name to id
* verify sub type is valid in caller
* assert sub type is valid in method
* simplify iterator usage

Among other things, this gets rid of this noise in the logs:

2012-07-10 20:51:42.617152 7facb23f1700 1 mon.a@1(peon).log v310 check_sub sub monmap not log type

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Mon, 13 Aug 2012 21:58:51 +0000 (14:58 -0700)]

v0.48.1argonaut

commit | commitdiff | tree

Yehuda Sadeh [Wed, 1 Aug 2012 20:22:38 +0000 (13:22 -0700)]

rgw: fix usage trim call encoding

Fixes: #2841.
Usage trim operation was encoding the wrong op structure (usage read).
Since the structures somewhat overlapped it somewhat worked, but user
info wasn't encoded.

Backport: argonaut
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

commit | commitdiff | tree

Yehuda Sadeh [Wed, 8 Aug 2012 22:21:53 +0000 (15:21 -0700)]

cls_rgw: fix rgw_cls_usage_log_trim_op encode/decode

It was not encoding user, adding that and reset version
compatibility.
This changes affects command interface, makes use of
radosgw-admin usage trim incompatible. Use of old
radosgw-admin usage trim should be avoided, as it may
remove more data than requested. In any case, upgraded
server code will not handle old client's trim requests.

backport: argonaut
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

commit | commitdiff | tree

Yehuda Sadeh [Tue, 31 Jul 2012 23:17:22 +0000 (16:17 -0700)]

rgw: expand date format support

Relaxing the date format parsing function to allow UTC
instead of GMT.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

commit | commitdiff | tree

Yehuda Sadeh [Thu, 2 Aug 2012 18:13:05 +0000 (11:13 -0700)]

rgw: complete multipart upload can handle chunked encoding

Fixes: #2878
We now allow complete multipart upload to use chunked encoding
when sending request data. With chunked encoding the HTTP_LENGTH
header is not required.

Backport: argonaut
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

commit | commitdiff | tree

Yehuda Sadeh [Wed, 1 Aug 2012 18:19:32 +0000 (11:19 -0700)]

rgw_xml: xml_handle_data() appends data string

Fixes: #2879.
xml_handle_data() appends data to the object instead of just
replacing it. Parsed data can arrive in pieces, specifically
when data is escaped.

Backport: argonaut
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

commit | commitdiff | tree

Yehuda Sadeh [Wed, 1 Aug 2012 20:09:41 +0000 (13:09 -0700)]

rgw: ETag is unquoted in multipart upload complete

Fixes #2877.
Removing quotes from ETag before comparing it to what we
have when completing a multipart upload.

Backport: argonaut
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

commit | commitdiff | tree

Josh Durgin [Wed, 8 Aug 2012 22:24:57 +0000 (15:24 -0700)]

MonMap: return error on failure in build_initial

If mon_host fails to parse, return an error instead of success.
This avoids failing later on an assert monmap.size() > 0 in the
monmap in MonClient.

Fixes: #2913
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Josh Durgin [Wed, 8 Aug 2012 22:10:27 +0000 (15:10 -0700)]

addr_parsing: report correct error message

getaddrinfo uses its return code to report failures.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Sage Weil [Wed, 8 Aug 2012 21:01:53 +0000 (14:01 -0700)]

mkcephfs: use default osd_data, _journal values

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>

commit | commitdiff | tree

Sage Weil [Wed, 8 Aug 2012 21:01:35 +0000 (14:01 -0700)]

mkcephfs: use new default keyring locations

The ceph-conf command only parses the conf; it does not apply default
config values. This breaks mkcephfs if values are not specified in the
config.

Let ceph-osd create its own key, fix copying, and fix creation/copying for
the mds.

Fixes: #2845
Reported-by: Florian Haas <florian@hastexo.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>

commit | commitdiff | tree

Sage Weil [Tue, 31 Jul 2012 21:01:57 +0000 (14:01 -0700)]

osd: peering: detect when log source osd goes down

The Peering state has a generic check based on the prior set osds that
will restart peering if one of them goes down (or one of the interesting
down ones comes up). The GetLog state, however, can pull the log from
a peer that is not in the prior set if it got a notify from them (e.g., an
osd in an old interval that was down when the prior set was calculated).
If that osd goes down, we don't detect it and will block forward.

Fix by adding a simple check in GetLog for the newest_update_osd going
down.

(BTW GetMissing does not suffer from this problem because
peer_missing_requested is a subset of the prior set, so the Peering check
is sufficient.)

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>

commit | commitdiff | tree

Sylvain Munaut [Tue, 31 Jul 2012 18:55:56 +0000 (11:55 -0700)]

rbd: fix off-by-one error in key name

Fixes: #2846
Signed-off-by: Sylvain Munaut <tnt@246tNt.com>

commit | commitdiff | tree

Sylvain Munaut [Tue, 31 Jul 2012 18:54:29 +0000 (11:54 -0700)]

secret: return error on empty secret

Signed-off-by: Sylvain Munaut <tnt@246tNt.com>

commit | commitdiff | tree

Sage Weil [Sat, 28 Jul 2012 17:05:47 +0000 (10:05 -0700)]

osd: set STRAY on pg load when non-primary

The STRAY bit indicates that we should annouce ourselves to the primary,
but it is only set in start_peering_interval(). We also need to set it
initially, so that a PG that is loaded but whose role does not change
(e.g., the stray replica stays a stray) will notify the primary.

Observed:
- osd starts up
- mapping does not change, STRAY not set
- does not announce to primary
- primary does not re-check must_have_unfound, objects appear unfound

Fix this by initializing STRAY when pg is loaded or created whenever we
are not the primary.

Fixes: #2866
Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Fri, 27 Jul 2012 23:03:26 +0000 (16:03 -0700)]

osd: peering: make Incomplete a Peering substate

This allows us to still catch changes in the prior set that would affect
our conclusions (that we are incomplete) and, when they happen, restart
peering.

Consider:
- calc prior set, osd A is down
- query everyone else, no good info
- set down, go to Incomplete (previously WaitActingChange) state.
- osd A comes back up (we do nothing)
- osd A sends notify message with good info (we ignore)

By making this a Peering substate, we catch the Peering AdvMap reaction,
which will notice a prior set down osd is now up and move to Reset.

Fixes: #2860
Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Fri, 27 Jul 2012 22:39:40 +0000 (15:39 -0700)]

osd: peering: move to Incomplete when.. incomplete

PG::choose_acting() may return false and *not* request an acting set change
if it can't find any suitable peers with enough info to recover. In that
case, we should move to Incomplete, not WaitActingChange, just like we do
a bit lower in GetLog() if we have non-contiguous logs. The state name is
more accurate, and this is also needed to fix bug #2860.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Fri, 27 Jul 2012 21:00:52 +0000 (14:00 -0700)]

Merge remote-tracking branch 'gh/stable' into stable-next

commit | commitdiff | tree

Sage Weil [Thu, 26 Jul 2012 23:35:00 +0000 (16:35 -0700)]

osd: fixing sharing of past_intervals on backfill restart

We need to share past_intervals whenever we instantiate the PG on a peer.
In the PG activation case, this is based on whether our peer_info[] value
for that peer is dne(). However, the backfill code was updating the
peer info (history) in the block preceeding the dne() check, which meant
we never shared past_intervals in this case and the peer would have to
chew through a potentially large number of maps if the PG has not been
clean recently.

Fix by checking dne() prior to the backfill block. We still need to fill
in the message later because it isn't yet instantiated.

Fixes: #2849
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>

commit | commitdiff | tree

Sage Weil [Thu, 26 Jul 2012 22:04:12 +0000 (15:04 -0700)]

Merge remote-tracking branch 'gh/wip-rbd-bid' into stable-next

commit | commitdiff | tree

Sage Weil [Mon, 23 Jul 2012 17:47:10 +0000 (10:47 -0700)]

mon: make 'ceph osd rm ...' wipe out all state bits, not just EXISTS

This ensures that when a new osd reclaims that id it behaves as if it were
really new.

Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Tue, 10 Jul 2012 03:54:19 +0000 (20:54 -0700)]

test_stress_watch: just one librados instance

This was creating a new cluster connection/session per iteration, and
along with it a few service threads and sockets and so forth.

Unfortunately, librados leaks like a sieve, starting with CephContext
and ceph::crypto::init(). See #845 and #2067.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Thu, 26 Jul 2012 22:03:50 +0000 (15:03 -0700)]

Merge commit '35b13266923f8095650f45562d66372e618c8824' into stable-next

First batch of msgr fixes.

commit | commitdiff | tree

Samuel Just [Mon, 9 Jul 2012 22:53:31 +0000 (15:53 -0700)]

ReplicatedPG: fix replay op ordering

After a client reconnect, the client replays outstanding ops. The
OSD then immediately responds with success if the op has already
committed (version < ReplicatedPG::get_first_in_progress).
Otherwise, we stick it in waiting_for_ondisk to be replied to when
eval_repop concludes that waitfor_disk is empty.

Fixes #2508

Signed-off-by: Samuel Just <sam.just@inktank.com>
Conflicts:

src/osd/ReplicatedPG.cc

commit | commitdiff | tree

Sage Weil [Wed, 18 Jul 2012 19:55:35 +0000 (12:55 -0700)]

objecter: always resend linger registrations

If a linger op (watch) is sent to the OSD and updates the object, and then
the client loses the reply, it will resend the request.  The OSD will see
that it is a dup, however, and not set up the in-memory session state for
the watch.  This in turn will break the watch (i.e., notifies won't
get delivered).

Instead, always resend linger registration ops, so that we always have a
unique reqid and do the correct session registeration for each session.

* track the tid of the registation op for each LingerOp
* mark registrations ops as should_resend=false; cancel as needed
* when we send a new registration op, cancel the old one to ensure we
   ignore the reply.  This is needed becuase we resend linger ops on any
   pg change, not just a primary change.
* drop the first_send arg to send_linger(), as we can now infer that
   from register_tid == 0.

The bug was easily reproduced with ms inject socket failures = 500 and the
test_stress_watch utility.

Fixes: #2796
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Sage Weil [Mon, 9 Jul 2012 20:22:42 +0000 (13:22 -0700)]

osd: guard class call decoding

Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Fri, 6 Jul 2012 01:08:58 +0000 (18:08 -0700)]

librados: take lock when signaling notify cond

When we are signaling the cond to indicate that a notify is complete,
take the appropriate lock. This removes the possibility of a race
that loses our signal. (That would be very difficult given that there
are network round trips involved, but this makes the lock/cond usage
"correct.")

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Sun, 22 Jul 2012 14:46:11 +0000 (07:46 -0700)]

workqueue: kick -> wake or _wake, depending on locking

Break kick() into wake() and _wake() methods, depending on whether the
lock is already held. (The rename ensures that we audit/fix all
callers.)

Signed-off-by: Sage Weil <sage@inktank.com>
Conflicts:

src/common/WorkQueue.h
src/osd/OSD.cc

commit | commitdiff | tree

Sage Weil [Wed, 4 Jul 2012 22:11:21 +0000 (15:11 -0700)]

client: fix locking for SafeCond users

Need to wait on flock, not client_lock.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Thu, 26 Jul 2012 22:01:05 +0000 (15:01 -0700)]

filestore: check for EIO in read path

Check for EIO in read methods and helpers. Try to do checks in low-level
methods (e.g., lfn_*()) to avoid duplication in higher-level methods.

The transaction apply function already checks for EIO on writes, and will
generate a nicer error message, so we can largely ignore the write path,
as long as errors get passed up correctly.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Thu, 26 Jul 2012 16:07:46 +0000 (09:07 -0700)]

filestore: add 'filestore fail eio' option, default true

By default we will assert/fail/crash on EIO from the underlying fs. We
already do this in the write path, but not the read path, or in various
internal infrastructure.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Tue, 24 Jul 2012 20:53:03 +0000 (13:53 -0700)]

config: fix 'config set' admin socket command

Fixes: #2832
Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Wed, 25 Jul 2012 23:35:09 +0000 (16:35 -0700)]

osd: break potentially large transaction into pieces

We do a similar trick elsewhere. Control this via a tunable. Eventually
we'll control the others (in a non-stable branch).

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Wed, 25 Jul 2012 21:53:34 +0000 (14:53 -0700)]

osd: only commit past intervals at end of parallel build

We don't check for gaps in the past intervals, so we should only commit
this when we are completely done. Otherwise a partial run and rsetart will
leave the gap in place, which may confuse the peering code that relies on
this information.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Wed, 25 Jul 2012 17:57:35 +0000 (10:57 -0700)]

osd: generate past intervals in parallel on boot

Even though we aggressively share past_intervals with notifies etc, it is
still possible for an osd to get buried behind a pile of old maps and need
to generate these if it has been out of the cluster for a while. This has
happened to us in the past but, sadly, we did not merge the work then.
On the bright side, this implementation is much much much cleaner than the
old one because of the pg_interval_t helper we've since switched to.

On bootup, we look at the intervals each pg needs and calclate the union,
and then iterate over that map range. The inner bit of the loop is
functionally identical to PG::build_past_intervals(), keeping the per-pg
state in the pistate struct.

Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Sage Weil [Wed, 25 Jul 2012 17:58:07 +0000 (10:58 -0700)]

osd: move calculation of past_interval range into helper

PG::generate_past_intervals() first calculates the range over which it
needs to generate past intervals. Do this in a helper function.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Sage Weil [Wed, 25 Jul 2012 17:58:28 +0000 (10:58 -0700)]

osd: fix map epoch boot condition

We only want to join the cluster if we can catch up to the latest
osdmap with a small number of maps, in this case a single map message.

Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>

commit | commitdiff | tree

Sage Weil [Wed, 25 Jul 2012 03:18:01 +0000 (20:18 -0700)]

mon: ignore pgtemp messages from down osds

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Wed, 25 Jul 2012 03:16:04 +0000 (20:16 -0700)]

mon: ignore osd_alive messages from down osds

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Josh Durgin [Mon, 23 Jul 2012 21:05:53 +0000 (14:05 -0700)]

librbd: replace assign_bid with client id and random number

The assign_bid method has issues with replay because it is a write
that also returns data. This means that the replayed operation would
return success, but no data, and cause a create to fail. Instead, let
the client set the bid based on its global id and a random number.

This only affects the creation of new images, since the bid is put
into an opaque string as part of the object prefix.

Keep the server side assign_bid around in case there are old clients
still using it.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Dan Mick [Mon, 9 Jul 2012 21:11:23 +0000 (14:11 -0700)]

librados: add new constructor to form a Rados object from IoCtx

This creates a separate reference to an existing connection, for
use when a client holding IoCtx needs to consult another (say,
for rbd cloning)

Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Sage Weil [Thu, 19 Jul 2012 02:49:58 +0000 (19:49 -0700)]

add CRUSH_TUNABLES feature bit

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Josh Durgin [Wed, 18 Jul 2012 17:24:58 +0000 (10:24 -0700)]

ObjectCacher: fix cache_bytes_hit accounting

Misses are not hits!

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Pascal de Bruijn | Unilogic Networks B.V [Wed, 11 Jul 2012 13:23:16 +0000 (15:23 +0200)]

Robustify ceph-rbdnamer and adapt udev rules

Below is a patch which makes the ceph-rbdnamer script more robust and
fixes a problem with the rbd udev rules.

On our setup we encountered a symlink which was linked to the wrong rbd:

  /dev/rbd/mypool/myrbd -> /dev/rbd1

While that link should have gone to /dev/rbd3 (on which a
partition /dev/rbd3p1 was present).

Now the old udev rule passes %n to the ceph-rbdnamer script, the problem
with %n is that %n results in a value of 3 (for rbd3), but in a value of
1 (for rbd3p1), so it seems it can't be depended upon for rbdnaming.

In the patch below the ceph-rbdnamer script is made more robust and it
now it can be called in various ways:

  /usr/bin/ceph-rbdnamer /dev/rbd3
  /usr/bin/ceph-rbdnamer /dev/rbd3p1
  /usr/bin/ceph-rbdnamer rbd3
  /usr/bin/ceph-rbdnamer rbd3p1
  /usr/bin/ceph-rbdnamer 3

Even with all these different styles of calling the modified script, it
should now return the same rbdname. This change "has" to be combined
with calling it from udev with %k though.

With that fixed, we hit the second problem. We ended up with:

  /dev/rbd/mypool/myrbd -> /dev/rbd3p1

So the rbdname was symlinked to the partition on the rbd instead of the
rbd itself. So what probably went wrong is udev discovering the disk and
running ceph-rbdnamer which resolved it to myrbd so the following
symlink was created:

  /dev/rbd/mypool/myrbd -> /dev/rbd3

However partitions would be discovered next and ceph-rbdnamer would be
run with rbd3p1 (%k) as parameter, resulting in the name myrbd too, with
the previous correct symlink being overwritten with a faulty one:

  /dev/rbd/mypool/myrbd -> /dev/rbd3p1

The solution to the problem is in differentiating between disks and
partitions in udev and handling them slightly differently. So with the
patch below partitions now get their own symlinks in the following style
(which is fairly consistent with other udev rules):

  /dev/rbd/mypool/myrbd-part1 -> /dev/rbd3p1

Please let me know any feedback you have on this patch or the approach
used.

Regards,
Pascal de Bruijn
Unilogic B.V.

Signed-off-by: Pascal de Bruijn <pascal@unilogicnetworks.net>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Sage Weil [Mon, 16 Jul 2012 23:02:14 +0000 (16:02 -0700)]

log: apply log_level to stderr/syslog logic

In non-crash situations, we want to make sure the message is both below the
syslog/stderr threshold and also below the normal log threshold. Otherwise
we get anything we gather on those channels, even when the log level is
low.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Mon, 16 Jul 2012 22:40:53 +0000 (15:40 -0700)]

log: fix event gather condition

We should gather an event if it is below the log or gather threshold.

Previously we were only gathering if we were going to print it, which makes
the dump no more useful than what was already logged.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Samuel Just [Mon, 16 Jul 2012 20:11:24 +0000 (13:11 -0700)]

PG::RecoveryState::Stray::react(LogEvt&): reset last_pg_scrub

We need to reset the last_pg_scrub data in the osd since we
are replacing the info.

Probably fixes #2453

In cases like 2453, we hit the following backtrace:

0> 2012-05-19 17:24:09.113684 7fe66be3d700 -1 osd/OSD.h: In function 'void OSD::unreg_last_pg_scrub(pg_t, utime_t)' thread 7fe66be3d700 time 2012-05-19 17:24:09.095719
osd/OSD.h: 840: FAILED assert(last_scrub_pg.count(p))

ceph version 0.46-313-g4277d4d (commit:4277d4d3378dde4264e2b8d211371569219c6e4b)
1: (OSD::unreg_last_pg_scrub(pg_t, utime_t)+0x149) [0x641f49]
2: (PG::proc_primary_info(ObjectStore::Transaction&, pg_info_t const&)+0x5e) [0x63383e]
3: (PG::RecoveryState::ReplicaActive::react(PG::RecoveryState::MInfoRec const&)+0x4a) [0x633eda]
4: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list3<boost::statechart::custom_reaction<PG::RecoveryState::MQuery>, boost::statechart::custom_reaction<PG::RecoveryState::MInfoRec>, boost::statechart::custom_reaction<PG::RecoveryState::MLogRec> >, boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0x130) [0x6466a0]
5: (boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x81) [0x646791]
6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x63dfcb]
7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x11) [0x63e0f1]
8: (PG::RecoveryState::handle_info(int, pg_info_t&, PG::RecoveryCtx*)+0x177) [0x616987]
9: (OSD::handle_pg_info(std::tr1::shared_ptr<OpRequest>)+0x665) [0x5d3d15]
10: (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0x2a0) [0x5d7370]
11: (OSD::_dispatch(Message*)+0x191) [0x5dd4a1]
12: (OSD::ms_dispatch(Message*)+0x153) [0x5ddda3]
13: (SimpleMessenger::dispatch_entry()+0x863) [0x77fbc3]
14: (SimpleMessenger::DispatchThread::entry()+0xd) [0x746c5d]
15: (()+0x7efc) [0x7fe679b1fefc]
16: (clone()+0x6d) [0x7fe67815089d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Because we don't clear the scrub state before reseting info,
the last_scrub_stamp state in the info.history structure
changes without updating the osd state resulting in the
above assert failure.

Backport: stable

Signed-off-by: Samuel Just <sam.just@inktank.com>

commit | commitdiff | tree

Samuel Just [Tue, 10 Jul 2012 00:57:03 +0000 (17:57 -0700)]

ReplicatedPG: don't warn if backfill peer stats don't match

pinfo.stats might be wrong if we did log-based recovery on the
backfilled portion in addition to continuing backfill.

bug #2750

Signed-off-by: Samuel Just <sam.just@inktank.com>

commit | commitdiff | tree

Sage Weil [Mon, 16 Jul 2012 03:30:34 +0000 (20:30 -0700)]

mon/MonitorStore: always O_TRUNC when writing states

It is possible for a .new file to already exist, potentially with a
larger size. This would happen if:

- we were proposing a different value
- we crashed (or were stopped) before it got renamed into place
- after restarting, a different value was proposed and accepted.

This isn't so unlikely for the log state machine, where we're
aggregating random messages. O_TRUNC ensure we avoid getting the tail
end of some previous junk.

I observed #2593 and found that a logm state value had a larger size on
one mon (after slurping) than the others, pointing to put_bl_sn_map().

While we are at it, O_TRUNC put_int() too; the same type of bug is
possible there, too.

Fixes: #2593
Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Sat, 14 Jul 2012 21:31:34 +0000 (14:31 -0700)]

osd: based misdirected op role calc on acting set

We want to look at the acting set here, nothing else. This was causing us
to erroneously queue ops for later (wasting memory) and to erroneously
print out a 'misdrected op' message in the cluster log (confusion and
incorrect [but ignored] -ENXIO reply).

Fixes: #2022
Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Josh Durgin [Fri, 13 Jul 2012 16:42:20 +0000 (09:42 -0700)]

qa: download tests from specified branch

These python tests aren't installed, so they need to be downloaded

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Yehuda Sadeh [Mon, 25 Jun 2012 16:47:37 +0000 (09:47 -0700)]

rgw: don't override subuser perm mask if perm not specified

Bug #2650. We were overriding subuser perm mask whenever subuser
was modified, even if perm mask was not passed.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

commit | commitdiff | tree

James Page [Wed, 11 Jul 2012 18:34:21 +0000 (11:34 -0700)]

debian: fix ceph-fs-common-dbg depends

Signed-off-by: James Page <james.page@ubuntu.com>

commit | commitdiff | tree

Yehuda Sadeh [Wed, 11 Jul 2012 18:52:24 +0000 (11:52 -0700)]

rados tool: remove -t param option for target pool

Bug #2772. This fixes an issue that was introduced when we
added the 'rados cp' command. The -t param was already used
for rados bench. With this change the only way to specify
a target pool is using --target-pool.
Though this problem is post argonaut, the 'rados cp' command
has been backported, so we need this fix there too.

Backport: argonaut

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

commit | commitdiff | tree

Sage Weil [Wed, 11 Jul 2012 16:19:00 +0000 (09:19 -0700)]

Makefile: don't install crush headers

This is leftover from when we built a libcrush.so. We can re-add when we
start doing that again.

Reported-by: Laszlo Boszormenyi <gcs@debian.hu>
Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Tue, 10 Jul 2012 20:18:27 +0000 (13:18 -0700)]

msgr: take over existing Connection on Pipe replacement

If a new pipe/socket is taking over an existing session, it should also
take over the Connection* associated with the existing session. Because
we cannot clear existing->connection_state, we just take another reference.

Clean up the comments a bit while we're here.

This affects MDS<->client sessions when reconnecting after a socket fault.
It probably also affects intra-cluster (osd/osd, mds/mds, mon/mon)
sessions as well, but I did not confirm that.

Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Mon, 9 Jul 2012 03:33:12 +0000 (20:33 -0700)]

debian: include librados-config in librados-dev

Reported-by: Laszlo Boszormenyi <gcs@debian.hu>
Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Tue, 3 Jul 2012 20:04:28 +0000 (13:04 -0700)]

lockdep: increase max locks

Hit this limit with the rados api tests.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Tue, 3 Jul 2012 19:07:28 +0000 (12:07 -0700)]

config: add unlocked version of get_my_sections; use it internally

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Tue, 3 Jul 2012 15:20:06 +0000 (08:20 -0700)]

config: fix lock recursion in get_val_from_conf_file()

Introduce a private, already-locked version.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Tue, 3 Jul 2012 15:15:08 +0000 (08:15 -0700)]

config: fix recursive lock in parse_config_files()

The _impl() helper is only called from parse_config_files(); don't retake
the lock.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Wed, 4 Jul 2012 01:51:02 +0000 (18:51 -0700)]

rgw: initialize fields of RGWObjEnt

This fixes various valgrind warnings triggered by the s3test
test_object_create_unreadable.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Yehuda Sadeh [Fri, 6 Jul 2012 20:14:53 +0000 (13:14 -0700)]

rgw: handle response-* params

Handle response-* params that set response header field values.
Fixes #2734, #2735.
Backport: argonaut

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

commit | commitdiff | tree

Sage Weil [Wed, 4 Jul 2012 20:59:04 +0000 (13:59 -0700)]

osd: add missing formatter close_section() to scrub status

Also add braces to make the open/close matchups easier to see. Broken
by f36617392710f9b3538bfd59d45fd72265993d57.

Signed-off-by: Sage Weil <sage@inktank.com>

Unnamed repository; edit this file 'description' to name the repository.

RSS Atom