Sage Weil [Mon, 20 Aug 2012 19:33:08 +0000 (12:33 -0700)]
osd: fix requeue order of dup ops
The waiting_for_ondisk (and ack) maps get dups of ops that are in progress.
If we have a peering change in which the role does not change, we will
requeue the in-progress ops but leave these in the waiting_for_ondisk
maps, which will then trigger an assert the next time we examine that map
and find it didn't match up with what we expected.
Fix this by requeuing these on any peering reset in on_change(). This
keeps the two queues in sync.
Fixes: #2956 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Fri, 17 Aug 2012 16:02:10 +0000 (09:02 -0700)]
mds: do not return null dentry lease on getattr
Specifically, /foo may exist and client may try to mount /foo/bar. That
GETATTR request is on #1/foo/bar, but we cannot return a null dentry on bar
because the client is not prepared to handle it and will crash in
fill_trace().
Fixes: #2959 Reported-by: Yan Zheng <zheng.z.yan@intel.com> Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Thu, 16 Aug 2012 00:19:22 +0000 (17:19 -0700)]
osd: explicitly requeue waiting_for_map in on_change()
Since we are requeuing stuff anyway, do it all in the correct order. This
fixes a bug where take_waiters() comes along later (at activate_map time)
and puts waiting_for_map events at the front of the queue, in front of
e.g. waiting_for_missing. This breaks ordering from the client's
perspective.
The convention should be: whenever you requeue, requeuing everything
that logically follows it first.
Fixes: #2947 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Danny Kukawka [Thu, 16 Aug 2012 10:56:58 +0000 (12:56 +0200)]
fix keyring generation for mds and osd
[ The following text is in the "UTF-8" character set. ]
[ Your display is set for the "ANSI_X3.4-1968" character set. ]
[ Some characters may be displayed incorrectly. ]
Fix config keys for OSD/MDS data dirs. As in documentation and other
places of the scripts the keys are 'osd data'/'mds data' and not
'osd_data'
In case if MDS: if 'mds data' doesn't exist, create it.
Signed-off-by: Danny Kukawka <danny.kukawka@bisect.de>
Danny Kukawka [Thu, 16 Aug 2012 10:56:32 +0000 (12:56 +0200)]
fix ceph osd create help
[ The following text is in the "UTF-8" character set. ]
[ Your display is set for the "ANSI_X3.4-1968" character set. ]
[ Some characters may be displayed incorrectly. ]
Change ceph osd create <osd-id> to ceph osd create <uuid>, since this
is what the command is really doing.
Signed-off-by: Danny Kukawka <danny.kukawka@bisect.de>
Yehuda Sadeh [Mon, 18 Jun 2012 20:25:44 +0000 (13:25 -0700)]
rgw: use multiple notification objects
Issue #2504. This makes us listen and notify on more than
a single object, which reduces the contention of cache
notifications.
NOTE: This change requires that any radosgw and radosgw-admin
use the same 'rgw num control oids' config value. A config value
of 0 will maintain old compatibility, and will allow an upgraded
process run in conjuction with an old one. Setting value other
than 0 (or using the non-zero default) will require upgrading
and restarting all the gateways together. Failing to do so
might lead to inconsistent user and buckets metadata (which
will be resolved once gateways are restarted).
caleb miles [Thu, 9 Aug 2012 20:27:21 +0000 (13:27 -0700)]
rgw_admin.cc: Allow removal of a user's buckets during user removal.
Allow the buckets, and any child objects, of a user to be deleted when the
user is deleted through radosgw-admin. In reference to feature request
2499: http://tracker.newdream.net/issues/2499.
Signed-off-by: caleb miles <caleb.miles@inktank.com>
Yehuda Sadeh [Wed, 1 Aug 2012 20:22:38 +0000 (13:22 -0700)]
rgw: fix usage trim call encoding
Fixes: #2841.
Usage trim operation was encoding the wrong op structure (usage read).
Since the structures somewhat overlapped it somewhat worked, but user
info wasn't encoded.
It was not encoding user, adding that and reset version
compatibility.
This changes affects command interface, makes use of
radosgw-admin usage trim incompatible. Use of old
radosgw-admin usage trim should be avoided, as it may
remove more data than requested. In any case, upgraded
server code will not handle old client's trim requests.
Yehuda Sadeh [Thu, 2 Aug 2012 18:13:05 +0000 (11:13 -0700)]
rgw: complete multipart upload can handle chunked encoding
Fixes: #2878
We now allow complete multipart upload to use chunked encoding
when sending request data. With chunked encoding the HTTP_LENGTH
header is not required.
Yehuda Sadeh [Wed, 1 Aug 2012 18:19:32 +0000 (11:19 -0700)]
rgw_xml: xml_handle_data() appends data string
Fixes: #2879.
xml_handle_data() appends data to the object instead of just
replacing it. Parsed data can arrive in pieces, specifically
when data is escaped.
Sage Weil [Wed, 8 Aug 2012 15:09:59 +0000 (08:09 -0700)]
buffer: make release() private
This should only be called by ~ptr or when we are replacing the current
target with something new. It is not suitable for external consumption
Because it doesn't reset length and offset.
caleb miles [Fri, 27 Jul 2012 18:26:21 +0000 (11:26 -0700)]
rgw_admin.cc: Disallow addition of S3 keys with subuser creation
Fixes: #1855
It is no longer possible to create a subuser and new S3 key associated
with that user through the radosgw-admin utility. In reference to Bug 1855
http://tracker.newdream.net/issues/1855.
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Signed-off-by: caleb miles <caleb.miles@inktank.com>
Tommi Virtanen [Thu, 2 Aug 2012 15:27:55 +0000 (08:27 -0700)]
doc: Simplify submodules explanation.
``git clone --recursive`` does ``init`` & ``update`` for us. Also
avoids incorrect language; there never were submodules called ``init``
and ``update``.
Sage Weil [Tue, 31 Jul 2012 21:01:57 +0000 (14:01 -0700)]
osd: peering: detect when log source osd goes down
The Peering state has a generic check based on the prior set osds that
will restart peering if one of them goes down (or one of the interesting
down ones comes up). The GetLog state, however, can pull the log from
a peer that is not in the prior set if it got a notify from them (e.g., an
osd in an old interval that was down when the prior set was calculated).
If that osd goes down, we don't detect it and will block forward.
Fix by adding a simple check in GetLog for the newest_update_osd going
down.
(BTW GetMissing does not suffer from this problem because
peer_missing_requested is a subset of the prior set, so the Peering check
is sufficient.)
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Tue, 31 Jul 2012 21:01:57 +0000 (14:01 -0700)]
osd: peering: detect when log source osd goes down
The Peering state has a generic check based on the prior set osds that
will restart peering if one of them goes down (or one of the interesting
down ones comes up). The GetLog state, however, can pull the log from
a peer that is not in the prior set if it got a notify from them (e.g., an
osd in an old interval that was down when the prior set was calculated).
If that osd goes down, we don't detect it and will block forward.
Fix by adding a simple check in GetLog for the newest_update_osd going
down.
(BTW GetMissing does not suffer from this problem because
peer_missing_requested is a subset of the prior set, so the Peering check
is sufficient.)
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Mon, 30 Jul 2012 20:43:51 +0000 (13:43 -0700)]
PG,ReplicatedPG: clarify scrub state clearing
scrub_clear_state takes care of clearing the SCRUB and REPAIR
flags. Thus, PG::scrub() needn't clear them again since
any change that would have caused that if block to occur
would have triggered ReplicatedPG::on_change(), which also
clears the scrub reservations.
Sage Weil [Sat, 28 Jul 2012 16:19:03 +0000 (09:19 -0700)]
osd: initialize send_notify on pg load
When the PG is loaded, we need to set send_notify if we are not the
primary. Otherwise, if the PG does not go through
start_peering_interval() or experience a role change, we will not set
the flag and tell the primary that we exist. This can cause problems
for example if we have unfound objects that the primary needs, although
I'm sure there are other bad implications as well.
Fixes: #2866 Signed-off-by: Sage Weil <sage@inktank.com>
test: test_keyvaluedb_iterators: Test KeyValueDB implementations iterators
This set of tests focus on testing the expected behavior of LevelDBStore's
and KeyValueDBMemory's iterators.
We test a grand total of six use cases, each one with several test
units, being tested for both the LevelDBStore and the in-memory mock
(totalling 48 test units, plus two disabled by default):
* Removing keys:
- Using both the whole-space iterator and the whole-space snapshot
iterator
- Tests key removal while iterating the store, either by prefix or by
removing specific (prefix,key) pairs
* Setting keys:
- Using both the whole-space iterator and the whole-space snapshot
iterator
- Tests key insertion while iterating the store
- Tests value update while iterating the store
- This use case has two disabled tests: one when setting keys, other
when updating values, both on LevelDBStore and using the whole-space
iterator; this is because they will fail, unlike when using the
in-memory mock implementation, because leveldb implicitely creates
an iterator that will read from a snapshot instead of directly from
the underlying store.
* Using Upper/Lower Bounds:
- Using the whole-space iterator (we don't modify the store's state,
so there is no need to also test the whole-space snapshot iterator)
- Tests upper/lower bounds when the key, the prefix or both are empty
- Tests upper/lower bounds when both the key and the prefix are set
* Seeking:
- Using the whole-space iterator (we don't modify the store's state,
so there is no need to also test the whole-space snapshot iterator)
- Tests seeking to first and to last
- Tests seeking to first and to last using a prefix
* Key-Space Iteration:
- Using the whole-space iterator (we don't modify the store's state,
so there is no need to also test the whole-space snapshot iterator)
- Tests forward and backward iteration over the key-space
* Empty Store:
- Using the whole-space iterator (we don't modify the store's state,
so there is no need to also test the whole-space snapshot iterator)
- Tests seeking and using bounds functions when the store is empty
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Create a set of functions, to be implemented by derivative classes of
KeyValueDB, responsible for returning an iterator with strong
read-consistency guarantees. How this iterator is implemented, or by what
is it backed up, is implementation specific, but it must guarantee that
all reads made using this iterator are as if there were no subsequent
writes to the store since we created the iterator.
For instance, LevelDBStore will back this iterator with a leveldb Snapshot,
while KeyValueDBMemory will perform a copy of its in-memory map.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
os: KeyValueDB: re-implement (prefix) iter in terms of whole-space iter
In-a-nutshell-version: Create a whole-space iterator interface, and
implement the already existing, prefix-based iterator in terms of the
new whole-space iterator;
This patch introduces a significant change on the architecture of
KeyValueDB's iterator, although its interface remains the same.
Before this patch, KeyValueDB simply defined an interface for a
prefix-based interface, to be implemented by derivative classes. Being
constrained by a prefix-based approach to iterate over the store only makes
sense when we know which prefixes we want to iterate over, but for that we
must know about the prefixes beforehand. This approach didn't work when one
wanted to iterate over the whole key space, without any previous awareness
about the keys and their prefixes.
This patch introduces a new interface for a whole-space iterator, to be
implemented by derivative classes, which is prefix-independent. We also
define an abstract function to obtain this iterator, which must also be
implemented by the derivative class. With this interface in place, we are
then able to implement a prefix-dependent iterator in terms of the
whole-space iterator, which will be offered by the KeyValueDB class itself.
Furthermore, we implement these changes on LevelDBStore and KeyValueDBMemory,
the in-memory mock store, which leads to significant changes on both:
* LevelDBStore
- Substitute the previously existing LevelDBIteratorImpl, which
followed a prefix-based iteration, for
LevelDBWholeSpaceIteratorImpl, which now iterates over the whole
key space of the store;
* KeyValueDBMemory:
- Substitute the previously existing MemIterator, which followed a
prefix-based iteration, for WholeSpaceMemIterator, which now
iterates over the whole key space of the in-memory mock store;
- Change the in-memory mock store data structure. Previously, we
used a map-of-maps, mapping prefixes to a key/value map; now we
keep a single map, mapping (prefix,key) pairs to values.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
test: workloadgen: Don't linearly iterate over a map to obtain a collection
We were iterating over the collections map a certain amount of times, in
order to obtain the collection in that position. To avoid this kind of
behavior in a function that may be called a large amount of times, and
that may iterate over a rather large map, we now keep the collection ids
in a vector. In order to obtain a given collection on position X, we will
simply look for the collection id on position X of the vector, and then
obtain the collection from the map using its collection id.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Fri, 27 Jul 2012 23:03:26 +0000 (16:03 -0700)]
osd: peering: make Incomplete a Peering substate
This allows us to still catch changes in the prior set that would affect
our conclusions (that we are incomplete) and, when they happen, restart
peering.
Consider:
- calc prior set, osd A is down
- query everyone else, no good info
- set down, go to Incomplete (previously WaitActingChange) state.
- osd A comes back up (we do nothing)
- osd A sends notify message with good info (we ignore)
By making this a Peering substate, we catch the Peering AdvMap reaction,
which will notice a prior set down osd is now up and move to Reset.
Fixes: #2860 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Fri, 27 Jul 2012 22:39:40 +0000 (15:39 -0700)]
osd: peering: move to Incomplete when.. incomplete
PG::choose_acting() may return false and *not* request an acting set change
if it can't find any suitable peers with enough info to recover. In that
case, we should move to Incomplete, not WaitActingChange, just like we do
a bit lower in GetLog() if we have non-contiguous logs. The state name is
more accurate, and this is also needed to fix bug #2860.
Sage Weil [Mon, 23 Jul 2012 21:41:17 +0000 (14:41 -0700)]
objecter: fix mon command resends
The monitor session is lossy. Send these when the op is initiated, or
when we reconnect. The timeout/cutoff was preventing ops from getting
resent if there was an ill-timed mon reset.
Backport: testing, stable/argonaut Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Thu, 26 Jul 2012 23:35:00 +0000 (16:35 -0700)]
osd: fixing sharing of past_intervals on backfill restart
We need to share past_intervals whenever we instantiate the PG on a peer.
In the PG activation case, this is based on whether our peer_info[] value
for that peer is dne(). However, the backfill code was updating the
peer info (history) in the block preceeding the dne() check, which meant
we never shared past_intervals in this case and the peer would have to
chew through a potentially large number of maps if the PG has not been
clean recently.
Fix by checking dne() prior to the backfill block. We still need to fill
in the message later because it isn't yet instantiated.
Fixes: #2849 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Sage Weil [Fri, 27 Jul 2012 04:55:00 +0000 (21:55 -0700)]
filestore: check for EIO in read path
Check for EIO in read methods and helpers. Try to do checks in low-level
methods (e.g., lfn_*()) to avoid duplication in higher-level methods.
The transaction apply function already checks for EIO on writes, and will
generate a nicer error message, so we can largely ignore the write path,
as long as errors get passed up correctly.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
By default we will assert/fail/crash on EIO from the underlying fs. We
already do this in the write path, but not the read path, or in various
internal infrastructure.
Signed-off-by: Sage Weil <sage@inktank.com>
Conflicts:
Sage Weil [Wed, 25 Jul 2012 17:57:35 +0000 (10:57 -0700)]
osd: generate past intervals in parallel on boot
Even though we aggressively share past_intervals with notifies etc, it is
still possible for an osd to get buried behind a pile of old maps and need
to generate these if it has been out of the cluster for a while. This has
happened to us in the past but, sadly, we did not merge the work then.
On the bright side, this implementation is much much much cleaner than the
old one because of the pg_interval_t helper we've since switched to.
On bootup, we look at the intervals each pg needs and calclate the union,
and then iterate over that map range. The inner bit of the loop is
functionally identical to PG::build_past_intervals(), keeping the per-pg
state in the pistate struct.