Yehuda Sadeh [Tue, 1 May 2012 23:47:32 +0000 (16:47 -0700)]
rgw: normalize bucket/obj before updating cache
Fixes bug #2369. The problem was that sometimes we send the
notification with the un-normalized bucket/obj pair. We
should make sure that we use the caonical name before doing
any cache update.
osd: add is_unmanaged_snaps_mode() to pg_pool_t; use more consistently
Create an is_unmanaged_snaps_mode() function to parallel
is_pool_snaps_mode(), and replace all the checks directly referencing
removed_snaps or snaps with calls to these functions.
Fixes #2345.
Sage Weil [Mon, 30 Apr 2012 20:36:37 +0000 (13:36 -0700)]
osdmap: do no dereference NULL entity_addr_t pointer in addr accessors
These may be NULL if we expand the addr vectors but haven't ever stored an
address yet. Check for NULL and return a reference to a blank
entity_addr_t as needed.
Sage Weil [Sun, 29 Apr 2012 15:17:06 +0000 (08:17 -0700)]
osd: keep pgs locked during handle_osd_map dance
Currently we drop and retake locks during handle_osd_map calls to
advance_map and activate_map. Instead, take them all once, and hold them.
This avoids leaving dirty in-core state in the PG without the lock held.
This will clearly go away as soon as the map threading stuff is redone.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sun, 29 Apr 2012 14:59:44 +0000 (07:59 -0700)]
osd: fix nested transaction in all_activated_and_committed()
all_activated_and_committed() is called from _activate_committed(), called
from a objectstore completion, and also from the state machine, which is
part of a larger transaction.
Instead, set dirty_info, and build/apply a transaction in the caller
(the completion) as needed. Fixes part of #2360.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sun, 29 Apr 2012 05:31:23 +0000 (22:31 -0700)]
osd: dirty_info if history.merge updated anything
In proc_replica_info and proc_primary_info, we may or may not update
the pg_info_t. If we do, set dirty_info, so that it will be recorded.
Same goes for when the primary pushes out updated stats to us.
Also, do not write a purged_snaps() update directory; rely on the caller
to write out dirty info.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sun, 29 Apr 2012 03:57:02 +0000 (20:57 -0700)]
osd: fix dirty_info check for advance/activate paths
Previously we would check and write dirty_info *without the pg lock* after
doing the advance and activate map calls. This was unlikely to race with
anything because the queues were drained, but definitely not right.
Instead, do the write in activate_map, or explicitly if activate_map is
not called (so that we record our progress after handling maps when we are
not up).
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sun, 22 Apr 2012 21:35:39 +0000 (14:35 -0700)]
log: do not set on_exit() callback for libraries
Set this up in either global_init() or common_init_finish(), both opportune
times that occur after config parsing has happened and the user has the
option to modify this behavior. The exception would be libraries like
librados, which can't use rados_conf_* to enable this. Arguably flush
functionality should be exposed through the librados API directly, instead
of futzing with on_exit().
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sat, 28 Apr 2012 22:49:40 +0000 (15:49 -0700)]
osd: always share past_intervals
Share past intervals when starting up new replicas. This can happen via
an MOSDPGInfo or an MOSDPGLog message.
Fix up get_or_create_pg() so the past_intervals arg is required (and a ref,
like the other args). Fix doxygen comment.
Now the only time generate_past_intervals() should do any work is when
upgrading old clusters, during pg creation, and (possibly) during pg
split (when that is fully implemented).
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sat, 28 Apr 2012 14:37:15 +0000 (07:37 -0700)]
osd: fill in past intervals during advance_map
If ceph-osd is way behind, we will advance through past maps before we
mark ourselves up. This avoids the slow recalculation once we are up, and
the ensuing badness.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sat, 28 Apr 2012 05:09:00 +0000 (22:09 -0700)]
osd: drop useless PG::fulfill_info()
There is a nice symmetry there with fulfill_log(), but it is a short
function with a single caller that mostly just forces us to copy a bunch
of data structures around unnecessarily. Drop it.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sat, 28 Apr 2012 05:08:03 +0000 (22:08 -0700)]
osd: share past intervals with notifies
Send past_intervals along with pg_info_t on every notify. The reasoning
here is as follows:
- we already have the state in memory
- if we don't send it, and the primary doesn't have it, it will
recalculate it by reading/decoding many previous maps from disk
- for a highly-tortured cluster, i see past_intervals on the order of
~6 KB, times 600 pgs means ~2.5 MB sent for every activate_map(). for
comparison, the same cluster would need to read and decode ~1 GB of
maps to recalculate the same info.
- for healthy clusters, the data is small, and costs little.
- for unhealthy clusters, the data is large, but most useful.
In theory we could set a threshold so that we don't send it if it is
large, but allow the primary to query it explicitly. I doubt it's worth
the complexity.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 27 Apr 2012 23:01:43 +0000 (16:01 -0700)]
osd: only generate missing intervals in generate_past_intervals
We can (currently) get into a situation where we don't have the full
history back to last_epoch_clean because non-primaries record past
intervals but don't initially have the full history, resulting in a partial
recent history.
If this happens, only fill in what's missing; no need to rebuild the recent
parts too.
Sage Weil [Fri, 27 Apr 2012 21:30:17 +0000 (14:30 -0700)]
osd: fix check for whether to recalculate past_intervals
We may not recalculate all the way back to last_interval_clean due to
the oldest_map floor. Figure out what we want and could calculate before
deciding whether what we have is insufficient.
Also, print something if we discard and recalculate so it is clear what is
happening and why.
Sage Weil [Fri, 27 Apr 2012 04:29:53 +0000 (21:29 -0700)]
mon: limit size of MOSDMap message sent as reply
We may send an MOSDMap as a reply to various requests, including
- a failure report
- a boot message
- a pg_temp message
- an up_thru message
In these cases, send a single MOSDMap message, but limit how big it gets.
All recipients here are osds, which are smart enough to request more maps
based on the MOSDMap::newest_map field.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sat, 28 Apr 2012 03:54:50 +0000 (20:54 -0700)]
osdmap: fix addr dedup check
Compare *every* address for a match, or else note that it is (or might be)
different. Previously, we falsely took diff==0 to mean that all addrs
were definitely equal, which was not necessarily the case.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
filestore: first lock osd mount point, next detect fs type
Fixes #2353. Problem was that there were (at least) two osd processes
that were racing for the fs detection, which triggered some errors
in the btrfs create/remove snapshot.
Sage Weil [Fri, 27 Apr 2012 04:51:23 +0000 (21:51 -0700)]
config: allow {get,set}_val on subsystem debug levels
This mimics the allows you to get and set subsystem debug levels via the
normal config access methods. Among other things, this allows librados
users to set debug levels.
Fixes: #2350 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 26 Apr 2012 18:12:11 +0000 (11:12 -0700)]
osdmap: dedup pg_temp
We only deal with the case where the entire map is identical, since the
individual items are too small to make the pointer overhead worthwhile.
Too bad. A in-memory btree-like structure would work better for this.