Samuel Just [Thu, 10 Mar 2011 23:37:33 +0000 (15:37 -0800)]
ReplicatedPG,OSD: Track which osds we are pulling from
Currently, a PG waiting on a pull from a dead OSD cannot continue
recovery. ReplicatedPG::pull now tracks open pulls by peer in
rec_from_peer (map<int, set<sobject_t> >).
OSD::advance_map now calls check_recovery_op_pulls to allow the PG to
reset pulls from failed peers.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
size_t is usually 32-bit on 32-bit architectures and 64 on 64-bit ones.
On the other hand, we want our offsets and lengths for librados and
librbd to be 64 bit everywhere. So we need to use uint64_t for offsets
and lengths.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Sage Weil [Fri, 11 Mar 2011 17:23:06 +0000 (09:23 -0800)]
osd: fix osdmap scanning on pg creation
On PG creation we were scanning the complete history of all osdmaps ever.
Fix initialization of PG::Info::History epoch_created and same_*_since
fields in the base (creation) case to make this work the way it was
supposed to.
Reported-by: Yehuda Sadeh <yehuda.sadeh@dreamhost.com> Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Tommi Virtanen [Fri, 11 Mar 2011 18:28:58 +0000 (10:28 -0800)]
auth: Allow using NSS as crypto library.
Added new configure flag --with-nss that enables this. NSS is also
automatically used if it is available and CryptoPP is not; use
--without-nss to explicitly forbid this.
No change on rgw crypto yet; rgw won't build without CryptoPP for now.
NSS initialization is in a static constructor for now. All it does is
set some values on in-memory data structures, so as long as no (other)
static constructor tries to use it, everything should just work. While
this could be moved to common_init, there are several other context
initialization steps with NSS, and a later refactoring to share the
results of these can just include NSS init as its first operation, at
practically no cost.
Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Sage Weil [Fri, 11 Mar 2011 20:45:16 +0000 (12:45 -0800)]
osd: wait for handle_osd_map transaction ondisk without doing a full sync
Doing a full sync (forcing a btrfs transaction etc) was just wrong here.
All we (might) care about is whether our Objectstore::Transaction is
stable (in journal or fs) or not.
We are still waiting for those operations to flush to the fs (to be
readable). That may not be necessary either, but shouldn't have a big
performance impact.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 11 Mar 2011 20:30:35 +0000 (12:30 -0800)]
osd: avoid setting up_thru on new PGs
This trades off the possibility of peering blockage if the OSDs in the
first interval (after pg creation) go down and stay down with avoiding
two osdmap updates for any pg creation. I think this is reasonable given
that:
- If the pg did go active, then it did go RW and assuming as much changed
nothing.
- If the pg did not go active, then it is empty, and there is no data
lost.
To do this:
- When peering during the first ever interval, don't bother setting
up_thru.
- Always mark that interval maybe_went_rw, even though up_thru isn't set
as such.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Tommi Virtanen [Fri, 11 Mar 2011 00:37:11 +0000 (16:37 -0800)]
clitest: Fix tests after osdmaptool --clobber bugfix.
Commit 5c8146b55dbd60bdfa47b53b93f2769f7d0524dc fixed clobber,
adjust clitests to match. Reordered test logic to have "fsid
does not change" check cover a run without --clobber.
Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Greg Farnum [Thu, 10 Mar 2011 22:06:16 +0000 (14:06 -0800)]
rados: Add "stat" option, and fix "put" to work on larger block sizes.
We didn't have a stat option, now we do.
Previously, "put" allocated its read space on the stack. That meant
the max block size was a little under 8MB, or you got a segfault! Now
it's on the stack, and you can set it as you like.
Sage Weil [Thu, 10 Mar 2011 21:06:08 +0000 (13:06 -0800)]
filestore: assert on ENOTEMPTY
ENOTEMPTY implies rmdir failed due to stray crap in the directory. We
should fail now, instead of later when we restart cosd and have stray
data sitting around.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 10 Mar 2011 19:20:54 +0000 (11:20 -0800)]
filer: set RMW bit on probe
Setting the RMW bits on the probe stat call will make the OSD wait for
pending writes on the object to flush to disk. This was a problem for MDS
takeover: the old instance had writes mid-commit on the OSD when the
journal probe/stat came in, and it didn't "see" the size change because
the PG was in 'delayed' mode and the old MDS's write hadn't been queued
for disk locally. Simply claiming we are a RMW op will force the PG
into rmw mode so that we know the prior write will be visible. This is
just fine for the metadata workload.
Fixes: #805 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 10 Mar 2011 18:17:24 +0000 (10:17 -0800)]
osd: fix peer no missing optimization
This shortcut was broken: we need to populate peer_missing with missing
objects in terms of the master log, not the peer's log (which may be old
or even divergent). This shortcut _only_ makes sense when the peer has
no missing in terms of a log that is perfectly up to date (i.e. matches
our last_update).
Reported-by: Henry Chang <henry.cy.chang@gmail.com> Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 10 Mar 2011 17:54:53 +0000 (09:54 -0800)]
osd: fix missing.rm()
The version specifies which version of the object no longer should be
missing. We should thus remove it from the missing set if we needed
anything less than OR EQUAL to that version. (If we are missing something
newer, then missing.rm() is a no-op.)
Make the argument name less weird while we're at it.
Reported-by: Henry Chang <henry.cy.chang@gmail.com> Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 10 Mar 2011 23:12:22 +0000 (15:12 -0800)]
mkcephfs: modularize
The goal is to support the old "ssh to everything" mode and also a
piecewise mode that lets the administrator do each step and handle
data copying and remote execution themselves.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Greg Farnum [Tue, 8 Mar 2011 17:36:52 +0000 (09:36 -0800)]
uclient: Clear the CEPH_CAP_FILE_BUFFER ref on _flush, if safe.
Previously we just returned if safe, but leaving the CEPH_CAP_FILE_BUFFER
ref around breaks _fsync horribly. The root cause of this is
update_inode_file_bits calling objectcacher->truncate_set without
clearing the BUFFER ref, but the mechanics of clearing it there are
complicated, and I don't believe there are any issues with keeping
around the extra reference, as long as it's cleared when necessary.