Sage Weil [Sun, 20 May 2012 22:32:19 +0000 (15:32 -0700)]
keys: new release key
New release key for signing packages. Signed by me (the old release key)
so that existing apt keyrings should be sufficient. New keyrings should
just add the new release key.
Josh Durgin [Wed, 16 May 2012 19:41:27 +0000 (12:41 -0700)]
librbd: check for cache flush errors
Return errors from flushing to the caller. Warn
if an error occurs during invalidation, but don't retry,
since the higher level handles these cases, namely:
* rollback (doing this with an image open is asking for trouble)
* shrink (doing this with writes in flight may create extra objects anyway)
* shutdown (qemu flushes before closing the device)
Josh Durgin [Tue, 15 May 2012 22:21:50 +0000 (15:21 -0700)]
ObjectCacher: handle write errors
If a write error occurs, mark the BufferHead dirty again, and
pass the return value to the completion. This makes flushing
return the write error, if one occurs, since the flush callback
is passed as the write callback.
Josh Durgin [Tue, 15 May 2012 17:58:59 +0000 (10:58 -0700)]
ObjectCacher: propagate read errors to the caller
Previously the return value of a read operation was ignored. Now a
read error sets the error field, and changes the BufferHead to a new
error state. Error state BufferHeads are treated as misses so they can
be retried when requested by a user of the ObjectCacher. When _readx
is called again internally, they're treated as hits so the error can
be returned to the user.
The error value is ignored if the BufferHead is not in the error
state.
Sage Weil [Wed, 16 May 2012 22:37:34 +0000 (15:37 -0700)]
mon: fix mon removal check
Only take our absence from the monmap to mean that we were removed if we
were ever a member in the first places.
This fixes the bootstrap case:
- create temp_monmap with existing member(s) plus new guy
- ceph-mon --mkfs --monmap temp_monmap --fsid ...
- start ceph-mon
Basically, this is just using the seed monmap as a way to tell the new
daemon which ip:port to use. Specifying mon addr, public network, or
public addr would also work.
Fixes: #2436 Signed-off-by: Sage Weil <sage@inktank.com>
Josh Durgin [Wed, 16 May 2012 20:40:43 +0000 (13:40 -0700)]
ObjectCacher: only perfcount reads requested by the client
_readx is called again after each bh is read by C_RetryRead. This
resulted in the read being counted many times for the internal
caller that was just checking whether it was done yet.
Josh Durgin [Mon, 14 May 2012 18:49:49 +0000 (11:49 -0700)]
Objecter: don't throttle resent linger ops
Throttling is intended to stop the caller from submitting too many
requests, not blocking requests that are being resent internally. This
prevents a deadlock when handling an osdmap - previously
handle_osd_map could block when resending linger ops due to the
throttling. This would stop the messenger's dispatch thread from
delivering any subsequest messages, so the throttle budget would never
be replenished.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Sat, 12 May 2012 21:59:55 +0000 (14:59 -0700)]
crush: pass weight vector size to map function
Pass the size of the weight vector into crush_do_rule() to ensure that we
don't access values past the end. This can happen if the caller misbehaves
and passes a weight vector that is smaller than max_devices.
Currently the monitor tries to prevent that from happening, but this will
gracefully tolerate previous bad osdmaps that got into this state. It's
also a bit more defensive.
workloadgen: forcing the user to specify a data and journal.
These default arguments, although handy when we just want to run the test,
just mess things up when we don't actually need them. If we don't specify
them on the CLI, we'll end up using the default ones, and that is just
annoying.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
workloadgen: Allow finer control over what the generator does.
Allow the user to have more control on:
- the sizes of the data being written by the operations;
- which operations are suppressed from execution;
- view the throughput;
- specify the periodicity of throughput output.
For the CLI options, '--help' should suffice.
Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
Sage Weil [Sun, 6 May 2012 21:18:22 +0000 (14:18 -0700)]
osd: reset last_peering_interval on replica activate
There was a silent bug in the activate 'acks' that go from the replica back
to the primary. Prior to 86aa07d7a91ac23074e76551c3a6db3a5736cffa, we
were passing same_interval_since to the callback, which mean that
sometimes _activate_committed() would ignore it and we wouldn't update
last_epoch_started. This was mosty invisible; the next peering event would
just, in some cases, look at more past intervals than it needed to.
In 86aa07d7a91ac23074e76551c3a6db3a5736cffa we fixed this so that the check
is correct. (We noticed because now we aren't setting the pg CLEAN flag
until after last_epoch_started is updated.) That, in turn, revealed a
similar bug that we're fixing here: the replica's last_peering_reset could
be lower than the primary's, such that the activate 'ack' info is ignored.
To fix this, simply set last_peering_reset to the current epoch when the
replica activates; this will always be greater than the primary's.
Sage Weil [Sat, 5 May 2012 21:34:54 +0000 (14:34 -0700)]
objectcacher: make cache sizes explicit
Make ObjectCacher users specify the cache size for each ObjectCacher
instances. This avoids the confusing config namespace for the object
cache (client_oc_*), and also will make it possible to eventually have
cache sizes that vary between (say) RBD images.
- drop unused client_oc_max_sync_write
- add rbd_cache_max_size, max_dirty, target_dirty config values (these are
the defaults for each image)
We probably want to add librbd calls to specify the cache size on a
per-image basis? Alternatively, we should make it possible to share a
cache pool between multiple images in some explicit way.
Sage Weil [Sat, 5 May 2012 23:25:26 +0000 (16:25 -0700)]
objectcacher: flush range, set
Add ability to flush a range of an object, or a vector of ObjectExtents. Flush
any buffers that intersect the specified range, or the entire object if len==0.
Sage Weil [Sat, 5 May 2012 18:24:57 +0000 (11:24 -0700)]
osd: do not mark pg clean until active is durable
Do not mark a PG CLEAN or set last_epoch_clean until after the PG activate
is stable on all replicas.
This effectively means that last_epoch_clean will never fall in an interval
that follows last_epoch_started's interval. It *can* be >
last_epoch_started when it falls within the same interval.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sat, 5 May 2012 20:07:06 +0000 (13:07 -0700)]
osd: check against last_peering_reset in _activate_committed
We are checking against last_peering_reset in _activate_committed(), so we
need to pass in that value to compare against; last_peering_reset may be
greater than same_interval_since, e.g. on a replica that learns about the
PG after the initial creation epoch.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Apparently S3_put_object() and S3_get_object() need to
run on the same thread as S3_runall_request_context() (at least
per context). So We now call them in the workqueue thread.
There was a bug when doing a read with multiple threads, when
one of the threads was left behind; when it returned the compared
data string might have been cluttered by newer strings that
were longer.
Sage Weil [Fri, 4 May 2012 22:26:33 +0000 (15:26 -0700)]
librados: call safe callback on read operation
This avoids confusion for the user who isn't sure if they should wait for
complete or safe on a read aio. It also means that you can always wait
for safe for both reads or writes, which can simplify some code.
Dup the roundtrip functional tests to verify this works.
Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Yehuda Sadeh <yehuda.sadeh@inktank.com>
Sage Weil [Fri, 4 May 2012 20:12:58 +0000 (13:12 -0700)]
objectcacher: don't wait for write waiters; wait after dirtying
We do three things here:
- Wait for the dirty limit to drop _after_ writing into the cache. This
means that an active thread can always provide its dirty data to the
cache for potential writing without waiting (a small win). It's also
helpful later... (see below, and next commit)
- Don't wait for other waiters. If another thread dirtying 1MB and is
waiting for it, don't wait for them too. This prevents two threads
writing 1MB at a time with a limit of 1MB from serializing: both can
dirty their 1MB and initiate a flush, and they once 1/2 of that has
flushed one of them will be allowed to proceed.
- Update the flusher to add the dirty_waiting bytes to the amount to
write so that the OPs will indeed be parallel.
Sage Weil [Fri, 4 May 2012 18:05:34 +0000 (11:05 -0700)]
crush: comment and clean up checks for check_item_loc and insert_item
- drop useless cur for check_item_loc
- comment the checks we're doing so the code is understandable
- use name_exists instead of broken get_item_id != 0 check