pgmon: clear out osd reports after the OSD has gone down
Previously we never removed report times from last_osd_report. Do
so, in check_osd_map (which, on the leader, is called synchronously
with adopting a new OSD Map).
Sage Weil [Wed, 11 Apr 2012 22:32:06 +0000 (15:32 -0700)]
mon: command to disable localized pgs for a pool
ceph osd pool disable_lpgs <poolname> --yes-i-really-mean-it
Grr, these should be off by default. We can't adjust them down. And
currently any pool adjustment triggers pg creation, which will create these
guys up through max_osd (but not, mind you, when max_osd changes). And
a bug in the OSDs makes them think that creation is a split and get
confused.
Sage Weil [Fri, 6 Apr 2012 16:33:43 +0000 (09:33 -0700)]
encoding: use iterator to copy_in encoded length
This gives us a pointer to the position into the list where the final
length value will be copied. Previously we used bl.copy_in(), which takes
a byte offset and needs iterator over the bufferlist to seek to the
correct position, resulting in O(n^2) encoding time for large structures.
Fixes: #2161 Reported-by: Jim Schutt <jaschut@sandia.gov> Diagnosed-by: Ake van der Meer <petrabbit@xs4all.nl> Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 3 Apr 2012 21:21:53 +0000 (14:21 -0700)]
rgw: throttle at num_threads * 2
If we throttle at num_threads, then nothing gets into the workqueue until
a worker thread is idle, which means you pay the latency of setting it up
and queueing it. This way we keep some requests ready to go.
Greg Farnum [Wed, 28 Mar 2012 22:06:32 +0000 (15:06 -0700)]
msgr: clean up Pipe::do_sendmsg.
Document it as with the tcp stuff, remove an if(0)'d debugging block,
and remove the useless "sd" parameter since it's always the same as
the Pipe's sd member.
Greg Farnum [Tue, 27 Mar 2012 19:57:14 +0000 (12:57 -0700)]
msgr: make a bunch of stuff private.
Why were all these data members public? They're accessed by Pipes
and the Accepter and stuff, so maybe that's why...but that's all
internal interface stuff.
Convert ms_addr and _my_name to be references to their fields in
the entity_inst_t my_inst.
This way we can use const references for accessing all of them,
instead of the bizarre distinction we had before for get_myinst().
Greg Farnum [Mon, 19 Mar 2012 20:12:14 +0000 (13:12 -0700)]
msgr: change the signature of get_myaddr()
Return a const reference to the actual address, instead of copying it.
All current users are happy with this, and I can't see a good reason
to copy it instead.
Greg Farnum [Thu, 8 Mar 2012 00:43:04 +0000 (16:43 -0800)]
msgr: get_connection() is required to establish a connection if none exists.
Making an allowance for lossy server connections is silly. Just don't
ask for the Connection in that case. (There aren't any users who
rely on the previous behavior.)
Document that requirement in Messenger.h!
Greg Farnum [Sat, 31 Mar 2012 00:07:19 +0000 (17:07 -0700)]
ceph_mon: fix fsid parsing.
fsid is a field in the CephContext _conf structure and is parsed by
the standard options parsing library before it gets to the ceph_mon
custom parsing.
Instead do the standard parsing, and check that member directly
to decide if we want to (over)write the monmap's fsid.
Sage Weil [Fri, 30 Mar 2012 23:14:05 +0000 (16:14 -0700)]
osd: update_stats() on reads too
Update pg stats on any op completion (read or write), not just writes. Do
the calls with log_op_stats() for consistency's sake. Skip if the request
was an error.
Fixes: #2209 Signed-off-by: Sage Weil <sage.weil@dreamhost.com> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
Tommi Virtanen [Wed, 28 Mar 2012 20:55:01 +0000 (13:55 -0700)]
doc: Convert the mailing list mention to not be a section heading.
If toctree is inside a section, the subtree is inside the section too.
We don't want all of dev/* to be under "Mailing list".
I have not found a decent workaround for this. The toplevel toctree
avoids this purely by the fact that it is the topmost toctree. Right
now that means you should 1) avoid having more than a few paragraphs of
text before the toctree for that subtree (put most of the content after
the toctree; clumsy if the toctree is long), or 2) put the toptree
immediately after the document title, make it :hidden:, and let the
reader use links in the text or the ToC in the sidebar to navigate.
See start/index for an example of this.
Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Sage Weil [Fri, 30 Mar 2012 16:51:45 +0000 (09:51 -0700)]
filestore: set guard on collection_move
During recovery we submit transactions like:
- delete a/foo
- move tmp/foo to a/foo
This prevents the EEXIST check in collection_move from doing any good,
since the destination never exists. We need to do that remove at least
sometimes, because we may be overwriting an existing/older version of the
object.
So,
- set the guard after we do the move, so that
- the delete won't be repated, and
- the EEXIST check will work
Also check the guard for good measure (although that doesn't do anything
specifically useful in this scenario).
Fixes: #2164 Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Josh Durgin <josh.durgin@dreamhost.com> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Thu, 29 Mar 2012 05:32:30 +0000 (22:32 -0700)]
osd: discard heartbeat_peer in note_down_osd
Discard the heartbeat_peer as soon as we find out, along with queued
failures, or else the heartbeat_check may come along (without map_lock)
and requeue a failure. And then later, when we try to report it, we'll
osdmap->get_inst() on a now-down OSD and fail miserably.
Reported-by: Wido den Hollander <wido@widodh.nl> Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
test: test_workload_gen: Add callback for collection destruction.
When we remove a collection, we must cleanup after the coll_entry_t we
once had on the available collections set. For some reason, we weren't
doing this.
This commit adds a new callback, which inherits from the 'OnReadable'
callback on the WorkloadGenerator class, that will be responsible for
deleting the coll_entry_t once we know the collection transaction
destroying the collection has finished.
test: test_workload_gen: Change CLI option and add '--help' usage.
With this commit, we support the following options (and old ones are no
longer available):
--test-num-colls VAL Set the number of collections
--test-num-objs-per-coll VAL Set the number of objects per
collection
--test-destroy-coll-per-N-trans VAL Set how many transactions to run
before destroying a collection.
And --help will show the program's usage description.
test: test_workload_gen: Default arguments, and minor changes.
Besides adding support for default arguments, passed onto global_init(),
this commit fixes a conflict in Makefile.am, and a missing lib
dependency. Also, we didn't used to pay attention to the return values
from store->mkfs() and store->mount(), and now do.
test: test_workload_gen: CodeStyle compliance and cleanup.
This commit aims at the compliance with Ceph's CodeStyle, as well
as cleaning up some lingering unused code.
Also, now we allow changing the default OSD data and journal
locations, as well as the OSD journal size, by providing the
options '--osd-data <PATH>', '--osd-journal <PATH>' and
'--osd-journal-size <VAL>' on the CLI arguments. If not provided,
these will default to 'workload_gen_dir', 'workload_gen_journal'
and '400', respectively.
In it's current state, the workload generator will queue a lot of
transactions onto the FileStore, and will wait if needed in case
there are too many in-flight transactions.
The workload generator will perform the transactions over a
pre-determined number of collections and objects, which may very
well be defined at runtime by using the options '-C <VAL>' and
'-O <VAL>' for collections and objects per collection, respectively.
If these are not provided, the program will default to 30 collections
and 6000 objects per collection.
Sage Weil [Tue, 27 Mar 2012 22:12:07 +0000 (15:12 -0700)]
osd: fix handling of recovery sources when osds go down
If a source osd goes down, we need to
- reset any pulls (already did that before)
- remove peer from missing_loc so that we know what is now unfound
- restart recovery/discover_all_missing in case new stuff is now unfound
This fixes a bug like so:
- we peer
- we find an object we need to recover on a stray osd
- that osd goes down
- recover_primary() thinks unfound=0 but it really is 1
... recover_primary 3270c60b/mds0_sessionmap/head 4'1 (missing) (missing head)
... pull 3270c60b/mds0_sessionmap/head v 4'1 but it is unfound
Sage Weil [Tue, 27 Mar 2012 22:02:51 +0000 (15:02 -0700)]
osd: maintain missing_loc_sources
This is a superset of all missing_loc values... everywhere we might
pull an object from, or are currently pulling from. Initially it's the
union, but as missing_loc shrinks it may contain peers that are no longer
in missing_loc.
Yehuda Sadeh [Tue, 27 Mar 2012 21:12:55 +0000 (14:12 -0700)]
rgw: remove pool_list(), can't list_objects() on system buckets
pool_list() was broken, replaced now with pool_iterate(). list_objects()
shouldn't be used any more with system buckets (raw pools), we can't
have it return sorted list of objects without reading the entire list.