Sage Weil [Wed, 16 Mar 2011 21:29:53 +0000 (14:29 -0700)]
Makefile: drop libradosgw_a LDFLAGS
Fixes the warning
src/Makefile.am:299: variable `libradosgw_a_LDFLAGS' is defined but no program or
src/Makefile.am:299: library has `libradosgw_a' as canonic name (possible typo)
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Wed, 16 Mar 2011 21:25:46 +0000 (14:25 -0700)]
mds: resync fragmentation during cache rejoin
During rejoin we may find that different MDSs have different fragmentation
for directories. When that happens we should refragment as needed on the
replicas to match what's on the primary.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Wed, 16 Mar 2011 05:18:45 +0000 (22:18 -0700)]
osd: only update last_epoch_started after all replicas commit peering results
The PG info.history.last_epoch_started is important because it bounds how
far back in time we think we need to look in order to fully recover the
contents of the PG. That's because every replica commits the PG peering
result (the info and pg log) when it activates.
In order for this to work properly, we can only advance last_epoch_started
_after_ the peer results are stable on disk on all replicas. Otherwise a
poorly timed failure (or set of failures) could lose the PG peer results
and we wouldn't go back far enough in time to find them.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sat, 20 Nov 2010 22:37:13 +0000 (14:37 -0800)]
filestore: adjust op_queue throttle max during fs commit
The underlying FS (btrfs at least) will block writes for a period while it
is doing a commit. If an OSD workload is write limited, we should raise
the op_queue max (operations that are queued to be applied to disk) during
the commit period.
For example, for a normally journal throughput limited (writeahead mode)
workload:
- journal queue throttle normally limits things.
- sync starts
- journaled items getting moved to op_queue soon fills up op_queue max
- all writes stop
- sync completes
- op_queue drains, new writes come in again
- journal queue throttle fills up, again starts limiting tput
For an fs throughput limited workload (writeahead):
- kernel buffer cache hits dirty limit
- op_queue throttle limits tput
- sync starts
- opq stalls, new writes stall on throttler
- sync completes
- opq drains (quickly: kernel has no dirty pages)
- new writes flood in
- etc.
(Actually this isn't super realistic, because hitting the kernel dirty
limit will do all sorts of other weird things with userland memory
allocations.)
In both cases, the commit phase blocks up the op queue, and raising the
limit temporarily will keep things flowing. This should be ok because the
disks are still busy during this period; they're just flushing dirty
data and metadata. Once the sync completes the opq will quickly dump dirty
data into the kernel page cache and "catch up".
Since cfuse usually runs as a nonprivileged user, its defaults must be a
little different from those of the other daemons. Add a flag to
common_init which can be used to set unprivileged daemon defaults.
SimpleMessenger::start() now just takes a boolean telling it whether to
daemonize. It doesn't need to check global variables or other arguments;
it just daemonizes if you tell it to; otherwise not.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Samuel Just [Thu, 10 Mar 2011 23:37:33 +0000 (15:37 -0800)]
ReplicatedPG,OSD: Track which osds we are pulling from
Currently, a PG waiting on a pull from a dead OSD cannot continue
recovery. ReplicatedPG::pull now tracks open pulls by peer in
rec_from_peer (map<int, set<sobject_t> >).
OSD::advance_map now calls check_recovery_op_pulls to allow the PG to
reset pulls from failed peers.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
size_t is usually 32-bit on 32-bit architectures and 64 on 64-bit ones.
On the other hand, we want our offsets and lengths for librados and
librbd to be 64 bit everywhere. So we need to use uint64_t for offsets
and lengths.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Sage Weil [Fri, 11 Mar 2011 17:23:06 +0000 (09:23 -0800)]
osd: fix osdmap scanning on pg creation
On PG creation we were scanning the complete history of all osdmaps ever.
Fix initialization of PG::Info::History epoch_created and same_*_since
fields in the base (creation) case to make this work the way it was
supposed to.
Reported-by: Yehuda Sadeh <yehuda.sadeh@dreamhost.com> Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Tommi Virtanen [Fri, 11 Mar 2011 18:28:58 +0000 (10:28 -0800)]
auth: Allow using NSS as crypto library.
Added new configure flag --with-nss that enables this. NSS is also
automatically used if it is available and CryptoPP is not; use
--without-nss to explicitly forbid this.
No change on rgw crypto yet; rgw won't build without CryptoPP for now.
NSS initialization is in a static constructor for now. All it does is
set some values on in-memory data structures, so as long as no (other)
static constructor tries to use it, everything should just work. While
this could be moved to common_init, there are several other context
initialization steps with NSS, and a later refactoring to share the
results of these can just include NSS init as its first operation, at
practically no cost.
Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Sage Weil [Fri, 11 Mar 2011 20:45:16 +0000 (12:45 -0800)]
osd: wait for handle_osd_map transaction ondisk without doing a full sync
Doing a full sync (forcing a btrfs transaction etc) was just wrong here.
All we (might) care about is whether our Objectstore::Transaction is
stable (in journal or fs) or not.
We are still waiting for those operations to flush to the fs (to be
readable). That may not be necessary either, but shouldn't have a big
performance impact.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 11 Mar 2011 20:30:35 +0000 (12:30 -0800)]
osd: avoid setting up_thru on new PGs
This trades off the possibility of peering blockage if the OSDs in the
first interval (after pg creation) go down and stay down with avoiding
two osdmap updates for any pg creation. I think this is reasonable given
that:
- If the pg did go active, then it did go RW and assuming as much changed
nothing.
- If the pg did not go active, then it is empty, and there is no data
lost.
To do this:
- When peering during the first ever interval, don't bother setting
up_thru.
- Always mark that interval maybe_went_rw, even though up_thru isn't set
as such.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>