Sage Weil [Thu, 23 Jun 2011 21:26:34 +0000 (14:26 -0700)]
osd: instrument readable latency too
Time before a write is readable (not necessarily on disk). Note that if we
get the commit first (e.g. writeahead journal) this value isn't calculated
or logged.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Tommi Virtanen [Wed, 22 Jun 2011 22:23:14 +0000 (15:23 -0700)]
run-cli-tests: Pass through CCACHE_DIR and such env vars.
Commit 7cd50f29d5cbf8deb64d00318b39c281119c0e03 makes the binaries
use libtool's "executable wrappers", which will transparently relink
the executables if they think that's needed. The test for that is
somewhat flawed: if the mtimes match, the binary might or might not
get relinked. In practise, this causes relinks on gitbuilder all the
time.
As of earlier commit 5a0bc6b78f2e40ec9255a1ea49f77ef9ea4690a6, we
started sanitizing the environment passed to the clitests. This meant
we also stripped away CCACHE_DIR and other settings, needed to
properly relink the binaries. Re-add CCACHE_DIR, CC, CXX to clitests
environment.
To handle the case where CCACHE_DIR etc are not set in the first
place, we need an extra wrapper script. Otherwise, ccache might see an
empty string as the env value, and naturally couldn't access a
directory by that name.
Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Sage Weil [Wed, 22 Jun 2011 19:45:35 +0000 (12:45 -0700)]
client: always use get_snap_context() accessor
There were a few places where we were using the member directly and not the
accessor, which rebuilds the internal value when needed. This caused
inconsistent behavior based on whether debugging was enabled or not, since
we used the accessor to print the regenerated value.
Greg Farnum [Tue, 21 Jun 2011 00:55:54 +0000 (17:55 -0700)]
AnchorServer: overwrite old Anchor backpointers when proper.
Sometimes when we do an AnchorTable update, it's because the inode
in question got moved. However, if the inode had multiple references,
its Anchor wasn't removed by decrementing its count, and so the add
function simply noted that it already had the Anchor and returned.
This obviously wasn't the proper behavior in cases where the inode
was getting moved -- we want to update its back pointer! So do so.
Greg Farnum [Mon, 20 Jun 2011 19:17:24 +0000 (12:17 -0700)]
uclient: Update statfs to match the kernel client and its block sizing.
Make it better match the kernel client, and its scheme to use a large
block size so we don't overflow 32-bit systems. This isn't presently
a serious concern since FUSE doesn't work on 32-bit systems anyway,
but the output of df should match even so.
Greg Farnum [Fri, 17 Jun 2011 18:52:13 +0000 (11:52 -0700)]
uclient: fix flush_caps(Inode*,int)
This function was just broken before. You need to be setting
flushing_cap_tids for the caps you're actually flushing, which
in this case is in->dirty_caps, not the horribly-named "flush"
variable.
Also, rename "flush" to "retain" since that's what it actually is!
Greg Farnum [Fri, 17 Jun 2011 18:08:26 +0000 (11:08 -0700)]
uclient: only change the auth_cap if the mseq is newer
Previously we just trusted the MDS' reported auth status, which meant
that even if the MDS was always right it could encode the stat while
auth, then export, and then have us decode the stat after getting
an IMPORT and set the auth cap right back!
Ensure that the signal waits for us to call sigsuspend, rather than
getting delievered to another thread. We do this by using SIGPIPE as our
signal of choice. SIGPIPE is blocked in all Ceph threads by default.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Samuel Just [Wed, 15 Jun 2011 23:57:02 +0000 (16:57 -0700)]
ReplicatedPG: clean up OpContext use in do_op
do_op, make_writeable, and friends now update new_obs, new_snapset, and
new_stats rather than updating those structures directly in case the
transaction fails to apply.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Samuel Just [Tue, 14 Jun 2011 22:29:44 +0000 (15:29 -0700)]
ReplicatedPG: Replica collection removal
Previously, snap collections and their contents on the replica would be
removed by the primary via a shipped transaction. However, these
operations are not represented in the log and are therefore not
reconstructed during recovery. Now, the replica, once recovery has
advanced adequately (last_complete_ondisk.epoch >=
info.history.last_epoch_started), replica will take each collection
from purged snaps, remove the objects, and then remove the collection.
Once the snap has made it into purged_snaps, it is guarranteed that no
further operations will require that snap collection, so this is safe.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Samuel Just [Fri, 20 May 2011 21:56:51 +0000 (14:56 -0700)]
ReplicatePG,PG: SnapTrimmer state machine skeleton
Currently, snap_trimmer does not wait for replicas to apply the object
removal repops before updating the info with the removed snapshot and
sending out infos. Thus, there is a race between the replica applying
the object removal transactions and recieving the info prompting it to
remove its local snap collection. In the cases where the info beats the
transaction to the file system, we get the infamous -39 ENOTEMPTY error
crashing the OSD.
One solution would be to block in snap_trimmer until we get the
responses. This would, unfortunately, tie up a disk_tp thread. Moving
to a state machine would:
a) guarrantee that the snap_trimmer state is cleaned up when the pg
resets (using on_change())
b) remove the necessity of blocking in a disk_tp thread (we currently
unlock and relock the pg lock between operations causing us to block on
another thread using the pg). Rather than unlocking and relocking, we
would instead unlock and requeue ourselves in the snap_trim_wq while we
wait allowing the thread to do something else.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Samuel Just [Wed, 8 Jun 2011 18:23:56 +0000 (11:23 -0700)]
ReplicatedPG,PG: update snap_collections on replica
Previously, snap_collections did not get updated on the replica. As a
result, snap collections would not necessarily get trimmed when the
replica recieved and updated purged_snaps via a pg info from the
primary. Now, the log entries in sub_op_modified are scanned to check
for any new snap collections. The first and last snaps on a clone entry
are possibly new snap collections.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Use libtool convenience libraries rather than explicitly forcing .a
files (static code archives) to be generated or including library .c
files directly into applications.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Sage Weil [Tue, 21 Jun 2011 19:47:32 +0000 (12:47 -0700)]
msgr: only SO_REUSEADDR when specific port is specified
In general SO_REUSEADDR is slightly dangerous, but avoids waiting for the
timeout when restarting servers. This is important when binding to a
specific port.
When binding to a random port, it doesn't matter. Also, it appears that
two processes can bind() to the same port with that flag set, and then one
will fail with EADDRINUSE on listen(). That's racy when starting up
daemons that should be binding to unique/random ports.
We need to initialize the last_decay time of the DecayCounter when
decoding it. This is not found in the encoded information, but instead
is set to the current time. We need to pass this in explicitly now
because of deglobalization.
This also reduces the number of calls to gettimeofday, which is good in
general.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>