Sage Weil [Wed, 16 Mar 2011 05:18:45 +0000 (22:18 -0700)]
osd: only update last_epoch_started after all replicas commit peering results
The PG info.history.last_epoch_started is important because it bounds how
far back in time we think we need to look in order to fully recover the
contents of the PG. That's because every replica commits the PG peering
result (the info and pg log) when it activates.
In order for this to work properly, we can only advance last_epoch_started
_after_ the peer results are stable on disk on all replicas. Otherwise a
poorly timed failure (or set of failures) could lose the PG peer results
and we wouldn't go back far enough in time to find them.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sat, 20 Nov 2010 22:37:13 +0000 (14:37 -0800)]
filestore: adjust op_queue throttle max during fs commit
The underlying FS (btrfs at least) will block writes for a period while it
is doing a commit. If an OSD workload is write limited, we should raise
the op_queue max (operations that are queued to be applied to disk) during
the commit period.
For example, for a normally journal throughput limited (writeahead mode)
workload:
- journal queue throttle normally limits things.
- sync starts
- journaled items getting moved to op_queue soon fills up op_queue max
- all writes stop
- sync completes
- op_queue drains, new writes come in again
- journal queue throttle fills up, again starts limiting tput
For an fs throughput limited workload (writeahead):
- kernel buffer cache hits dirty limit
- op_queue throttle limits tput
- sync starts
- opq stalls, new writes stall on throttler
- sync completes
- opq drains (quickly: kernel has no dirty pages)
- new writes flood in
- etc.
(Actually this isn't super realistic, because hitting the kernel dirty
limit will do all sorts of other weird things with userland memory
allocations.)
In both cases, the commit phase blocks up the op queue, and raising the
limit temporarily will keep things flowing. This should be ok because the
disks are still busy during this period; they're just flushing dirty
data and metadata. Once the sync completes the opq will quickly dump dirty
data into the kernel page cache and "catch up".
Since cfuse usually runs as a nonprivileged user, its defaults must be a
little different from those of the other daemons. Add a flag to
common_init which can be used to set unprivileged daemon defaults.
SimpleMessenger::start() now just takes a boolean telling it whether to
daemonize. It doesn't need to check global variables or other arguments;
it just daemonizes if you tell it to; otherwise not.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Samuel Just [Thu, 10 Mar 2011 23:37:33 +0000 (15:37 -0800)]
ReplicatedPG,OSD: Track which osds we are pulling from
Currently, a PG waiting on a pull from a dead OSD cannot continue
recovery. ReplicatedPG::pull now tracks open pulls by peer in
rec_from_peer (map<int, set<sobject_t> >).
OSD::advance_map now calls check_recovery_op_pulls to allow the PG to
reset pulls from failed peers.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
size_t is usually 32-bit on 32-bit architectures and 64 on 64-bit ones.
On the other hand, we want our offsets and lengths for librados and
librbd to be 64 bit everywhere. So we need to use uint64_t for offsets
and lengths.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Sage Weil [Fri, 11 Mar 2011 17:23:06 +0000 (09:23 -0800)]
osd: fix osdmap scanning on pg creation
On PG creation we were scanning the complete history of all osdmaps ever.
Fix initialization of PG::Info::History epoch_created and same_*_since
fields in the base (creation) case to make this work the way it was
supposed to.
Reported-by: Yehuda Sadeh <yehuda.sadeh@dreamhost.com> Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Tommi Virtanen [Fri, 11 Mar 2011 18:28:58 +0000 (10:28 -0800)]
auth: Allow using NSS as crypto library.
Added new configure flag --with-nss that enables this. NSS is also
automatically used if it is available and CryptoPP is not; use
--without-nss to explicitly forbid this.
No change on rgw crypto yet; rgw won't build without CryptoPP for now.
NSS initialization is in a static constructor for now. All it does is
set some values on in-memory data structures, so as long as no (other)
static constructor tries to use it, everything should just work. While
this could be moved to common_init, there are several other context
initialization steps with NSS, and a later refactoring to share the
results of these can just include NSS init as its first operation, at
practically no cost.
Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Sage Weil [Fri, 11 Mar 2011 20:45:16 +0000 (12:45 -0800)]
osd: wait for handle_osd_map transaction ondisk without doing a full sync
Doing a full sync (forcing a btrfs transaction etc) was just wrong here.
All we (might) care about is whether our Objectstore::Transaction is
stable (in journal or fs) or not.
We are still waiting for those operations to flush to the fs (to be
readable). That may not be necessary either, but shouldn't have a big
performance impact.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 11 Mar 2011 20:30:35 +0000 (12:30 -0800)]
osd: avoid setting up_thru on new PGs
This trades off the possibility of peering blockage if the OSDs in the
first interval (after pg creation) go down and stay down with avoiding
two osdmap updates for any pg creation. I think this is reasonable given
that:
- If the pg did go active, then it did go RW and assuming as much changed
nothing.
- If the pg did not go active, then it is empty, and there is no data
lost.
To do this:
- When peering during the first ever interval, don't bother setting
up_thru.
- Always mark that interval maybe_went_rw, even though up_thru isn't set
as such.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Tommi Virtanen [Fri, 11 Mar 2011 00:37:11 +0000 (16:37 -0800)]
clitest: Fix tests after osdmaptool --clobber bugfix.
Commit 5c8146b55dbd60bdfa47b53b93f2769f7d0524dc fixed clobber,
adjust clitests to match. Reordered test logic to have "fsid
does not change" check cover a run without --clobber.
Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Greg Farnum [Thu, 10 Mar 2011 22:06:16 +0000 (14:06 -0800)]
rados: Add "stat" option, and fix "put" to work on larger block sizes.
We didn't have a stat option, now we do.
Previously, "put" allocated its read space on the stack. That meant
the max block size was a little under 8MB, or you got a segfault! Now
it's on the stack, and you can set it as you like.
Sage Weil [Thu, 10 Mar 2011 21:06:08 +0000 (13:06 -0800)]
filestore: assert on ENOTEMPTY
ENOTEMPTY implies rmdir failed due to stray crap in the directory. We
should fail now, instead of later when we restart cosd and have stray
data sitting around.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 10 Mar 2011 19:20:54 +0000 (11:20 -0800)]
filer: set RMW bit on probe
Setting the RMW bits on the probe stat call will make the OSD wait for
pending writes on the object to flush to disk. This was a problem for MDS
takeover: the old instance had writes mid-commit on the OSD when the
journal probe/stat came in, and it didn't "see" the size change because
the PG was in 'delayed' mode and the old MDS's write hadn't been queued
for disk locally. Simply claiming we are a RMW op will force the PG
into rmw mode so that we know the prior write will be visible. This is
just fine for the metadata workload.
Fixes: #805 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 10 Mar 2011 18:17:24 +0000 (10:17 -0800)]
osd: fix peer no missing optimization
This shortcut was broken: we need to populate peer_missing with missing
objects in terms of the master log, not the peer's log (which may be old
or even divergent). This shortcut _only_ makes sense when the peer has
no missing in terms of a log that is perfectly up to date (i.e. matches
our last_update).
Reported-by: Henry Chang <henry.cy.chang@gmail.com> Signed-off-by: Sage Weil <sage.weil@dreamhost.com>