Sage Weil [Sat, 24 Sep 2011 22:19:31 +0000 (15:19 -0700)]
osd: drop map_cache_keep_from
The purpose here was to avoid trimming cached maps prior to what we have
on disk. However, now that we have the map_bl cache, this isn't needed:
anything after that epoch will come out of that cache.
Also, it was broken anyway--the value was never read. So clean it out!
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sat, 24 Sep 2011 22:16:53 +0000 (15:16 -0700)]
osd: limit size of OSDMap cache
If we get way way behind on our maps, we may end up with a really large
OSDMap cache because we currently on trim old maps based on
oldest_last_clean, which may be way in the past. Avoid eating up gobs of
RAM by putting a ceiling on the cache size. It'll mean more disk IO in
those situations, but it also means that we'll only load up the old maps
that we actually need (not every single one).
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sat, 24 Sep 2011 21:45:32 +0000 (14:45 -0700)]
osd: don't finish boot unless instance in map is really us
We were going BOOTING->ACTIVE as soon as we showed up in the map with the
same client_addr. Also verify that we were up_from an epoch after when
we started or rebound, to avoid the case where we rebind to the same
ports for client_addr (but maybe not others) and get caught in a rebind
loop.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Henry C Chang [Thu, 11 Nov 2010 03:40:39 +0000 (11:40 +0800)]
mon: remember source client address in routed requests
when we resend_routed_requests, the source client address is lost.
This may cause problems. For example, if we resend an mds beacon
(boot) to a new monitor leader, the new mdsmap will contain a new
mds entry without IP address.
To reproduce this bug:
1. deploy a cluster with 3 mons.
2. let active mds send beacon to mon0; standby mds send becaon to mon2
3. gdb attach to mon2 to make it unresponsive and make standby mds laggy.
4. gdb attach to mon0 to make it unresponsive and make active mds laggy.
5. detach mon2, then the standby mds will become active.
6. ceph mds dump -o - shows the active mds address is :/0
Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Fri, 23 Sep 2011 19:53:39 +0000 (12:53 -0700)]
osd: fix race between handle_osd_ping and handle_osd_map
If handle_osd_map is in progress and handle_osd_ping doesn't have the
map lock, we can't call osdmap->get_inst() in send_still_alive(). Keep
the entity_inst_t in the failure_pending map so that we don't need to.
osd/OSDMap.h: In function 'entity_inst_t OSDMap::get_inst(int)', in thread '0x7fda6a46b710'
osd/OSDMap.h: 477: FAILED assert(is_up(osd))
ceph version 0.24.1 (commit:e06fb657842379259826f3d9215101fc14575fbd)
1: (OSD::send_still_alive(int)+0x1b9) [0x4dd1b9]
2: (OSD::handle_osd_ping(MOSDPing*)+0x716) [0x4f6446]
3: (OSD::heartbeat_dispatch(Message*)+0x36) [0x4f6666]
4: (SimpleMessenger::dispatch_entry()+0x882) [0x46f002]
5: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x465fac]
6: (()+0x6a3a) [0x7fda7835aa3a]
7: (clone()+0x6d) [0x7fda76f7777d]
Reported-by: Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by: Sage Weil <sage@newdream.net>
096e3b6353e5035362cffdcbd2e4a4f5572aa2ba broke this by only using
set_pool_image_name for commands that accept the snapshot
parameter. This whole undocumented format parsing should be reworked
or removed at some point.
Sage Weil [Wed, 21 Sep 2011 22:46:37 +0000 (15:46 -0700)]
osd: fix PG::copy_after vs backlog
If you call copy_after(..., 0) on a log with a backlog, you get all the
backlog entries, but no backlog flag. That's invalid. You either need
the _complete_ backlog + the flag, or no backlog entries; getting only
some of them is useless information.
Make copy_after stop when it hits the tail. Callers who need the backlog
are already checking for that and copying the whole log as appropriate.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Wed, 21 Sep 2011 17:10:54 +0000 (10:10 -0700)]
client: tear down dir when setting I_COMPLETE on empty
If the dir is supposed to be empty and we are setting I_COMPLETE, empty
it out and close it. This ensures we don't return bad results on a
subsequent readdir().
_Maybe_ related to #1509.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Mon, 19 Sep 2011 23:50:48 +0000 (16:50 -0700)]
osd: set reply version for dup requests
If we get a dup request, set the version in the reply. That means the
client knows the client was successful and committed, and they know the
version. They don't get anything else (e.g., data payload resulting from
mutations).
Sage Weil [Wed, 21 Sep 2011 00:01:38 +0000 (17:01 -0700)]
cconf: do not common_init_finish
All that common_init_finish() does is indicate that we're done initializing
and we're allowed to start up extra threads and do the wonky things that
daemons like to do (like set up the admin socket). Since cconf is just
examining the config, we don't want to do any of that.
Samuel Just [Tue, 20 Sep 2011 20:34:31 +0000 (13:34 -0700)]
OSD: return NULL when the OSD does not have the pg in lookup_lock_raw_pg
Previously, we returned NULL if the osd lacked the pool, but not if the
osd had the pool and lacked the pg. In that case, the assert in
_lookup_lock_pg would crash the osd.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Tue, 20 Sep 2011 17:58:20 +0000 (10:58 -0700)]
osd: remove throttle_op_queue()
There are subtle annoying problems with throttling and requeueing, and
throttling at this particular point in the stack makes little sense
anyway. We have
- messenger queue. throttled based on total bytes/payload
- op_queue, throttled before we queue items.
There is no real value in throttling a message before checking whether it
is valid (sent to the right osd, etc.) or putting it on the op_queue,
where it will sit until a worker thread picks it up and processes it.
When we get an osd_map, for instance, we pause op_queue, requeue
everything on the op_queue for reprocessing, and do the map update, so
not having a load of messages on that queue doesn't hurt us. It just
complicates requeueing in the throttle_op_queue case, and delays the
checks for non-existent PGs or misdirected requests.
Sage Weil [Tue, 20 Sep 2011 01:23:10 +0000 (18:23 -0700)]
osd: preserve ordering when throttling races with missing/degraded requeue
When we delay an op because the op_queue is full, we can violate the op
order:
- op1 comes in, waits because object is missing
- op2 comes in, throttles on op queue
- op1 is requeued (no longer missing)
- queue drains, op2 happens
- op1 happens
To avoid this, if we delay, requeue ourselves... after whatever else is
on the queue.
Fixes: #1490 Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Mon, 19 Sep 2011 23:50:48 +0000 (16:50 -0700)]
osd: set reply version for dup requests
If we get a dup request, set the version in the reply. That means the
client knows the client was successful and committed, and they know the
version. They don't get anything else (e.g., data payload resulting from
mutations).
Sage Weil [Mon, 19 Sep 2011 21:00:59 +0000 (14:00 -0700)]
osd: clear need_up_thru in build_prior as appropriate
The only time need_up_thru is cleared is in the Peering state AdvMap
handler, but it doesn't get called if prior_set_affected() and we go
into build_prior(). Build_prior() sets need_up_thru if it's needed, but
it doesn't clear it if its not, which means the pg never goes active.
Reported-by: Sam Lang <samlang@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>