Currently for pools with different rules, "ceph df" cannot report
right available space for them, respectively. For detail assisment
of the bug ,pls refer to bug report #8943
This patch fix this bug and make ceph df works correctlly.
The dirty_or_tx list is used by flush_set, which means we can
resubmit new IOs for writes that are already in progress. This
has a compounding effect that overwhelms the OSDs with dup IOs
and stalls out the client.
See, for example, teh failues in this run:
/a/sage-2014-07-25_17:14:20-fs-wip-msgr-testing-basic-plana
The fix is probably pretty simple, but reverting for now to make
the tests pass.
John Spray [Thu, 17 Jul 2014 12:15:45 +0000 (13:15 +0100)]
mds: fix journal reformat failure in standbyreplay
In the 0.82 release, standbyreplay MDS daemons would try
to reformat the jouranl if they saw an older version on
disk, where this should have only been done by the active
MDS for the rank. Depending on timing, this could cause
fatal corruption of the journal.
This change handles the following cases:
* only do reformat if not in standbyreplay (else raise EAGAIN
to keep trying til an active mds reformats it)
* if journal header goes away while in standbyreplay then raise
EAGAIN (handle rewrite happening in background)
* if journal version is greater than the max supported, suicide
Sage Weil [Fri, 25 Jul 2014 20:17:32 +0000 (13:17 -0700)]
common/RefCountedObject: fix use-after-free in debug print
We could race with another thread that deletes this right after we call
dec(). Our access of cct would then become a use-after-free. Valgrind
managed to turn this up.
Copy it into a local variable before the dec() to be safe, and move the
dout line below to make this possibility explicit and obvious in the code.
Fixes: #8442
Backport: firefly
Data pools might have strict write alignment requirements. Use pool
alignment info when setting the max_chunk_size for the write.
Sage Weil [Thu, 24 Jul 2014 01:25:53 +0000 (18:25 -0700)]
osd/ReplicatedPG: observe INCOMPLETE_CLONES in is_present_clone()
We cannot assume that just because cache_mode is NONE that we will have
all clones present; check for the absense of the INCOMPLETE_CLONES flag
here too.
Sage Weil [Thu, 24 Jul 2014 01:24:51 +0000 (18:24 -0700)]
osd/ReplicatedPG: observed INCOMPLETE_CLONES when doing clone subsets
During recovery, we can clone subsets if we know that all clones will be
present. We skip this on caching pools because they may not be; do the
same when INCOMPLETE_CLONES is set.
Sage Weil [Thu, 24 Jul 2014 01:23:56 +0000 (18:23 -0700)]
osd/ReplicatedPG: do not complain about missing clones when INCOMPLETE_CLONES is set
When scrubbing, do not complain about missing cloens when we are in a
caching mode *or* when the INCOMPLETE_CLONES flag is set. Both are
indicators that we may be missing clones and that that is okay.
Fixes: #8882 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Thu, 24 Jul 2014 01:21:38 +0000 (18:21 -0700)]
osd/osd_types: add pg_pool_t FLAG_COMPLETE_CLONES
Set a flag on the pg_pool_t when we change cache_mode NONE. This
is because object promotion may promote heads without all of the clones,
and when we switch the cache_mode back those objects may remain. Do
this on any cache_mode change (to or from NONE) to capture legacy
pools that were set up before this flag existed.
Sage Weil [Sat, 19 Jul 2014 06:16:09 +0000 (23:16 -0700)]
os/LFNIndex: only consider alt xattr if nlink > 1
If we are doing a lookup, the main xattr fails, we'll check if there is an
alt xattr. If it exists, but the nlink on the inode is only 1, we will
kill the xattr. This cleans up the mess left over by an incomplete
lfn_unlink operation.
This resolves the problem with an lfn_link to a second long name that
hashes to the same short_name: we will ignore the old name the moment the
old link goes away.
Fixes: #8701 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Sat, 19 Jul 2014 00:28:18 +0000 (17:28 -0700)]
os/LFNIndex: remove alt xattr after unlink
After we unlink, if the nlink on the inode is still non-zero, remove the
alt xattr. We can *only* do this after the rename or unlink operation
because we don't want to leave a file system link in place without the
matching xattr; hence the fsync_dir() call.
Note that this might leak an alt xattr if we happen to fail after the
rename/unlink but before the removexattr is committed. We'll fix that
next.
Sage Weil [Sat, 19 Jul 2014 00:09:07 +0000 (17:09 -0700)]
os/LFNIndex: handle long object names with multiple links (i.e., rename)
When we rename an object (collection_move_rename) to a different name, and
the name is long, we run into problems because the lfn xattr can only track
a single long name linking to the inode. For example, suppose we have
foobar -> foo_123_0 (attr: foobar) where foobar hashes to 123.
At first, collection_add could only link a file to another file in a
different collection with the same name. Allowing collection_move_rename
to rename the file, however, means that we have to convert:
col1/foobar -> foo_123_0 (attr: foobar)
to
col1/foobaz -> foo_234_0 (attr: foobaz)
This is a problem because if we link, reset xattr, unlink we end up with
col1/foobar -> foo_123_0 (attr: foobaz)
if we restart after we reset the attr. This will cause the initial foobar
lookup to since the attr doesn't match, and the file won't be able to be
looked up.
Fix this by allow *two* (long) names to link to the same inode. If we
lfn_link a second (different) name, move the previous name to the "alt"
xattr and set the new name. (This works because link is always followed
by unlink.) On lookup, check either xattr.
Don't even bother to remove the alt xattr on unlink. This works as long
as the old name and new name don't hash to the same shortname and end up
in the same LFN chain. (Don't worry, we'll fix that next.)
Fixes part of #8701 Signed-off-by: Sage Weil <sage@redhat.com>
Dan Mick [Thu, 3 Jul 2014 23:08:44 +0000 (16:08 -0700)]
Fix/add missing dependencies:
- rbd-fuse depends on librados2/librbd1
- ceph-devel depends on specific releases of libs and libcephfs_jni1
- librbd1 depends on librados2
- python-ceph does not depend on libcephfs1
Sage Weil [Tue, 22 Jul 2014 20:16:11 +0000 (13:16 -0700)]
osd/ReplicatedPG: greedily take write_lock for copyfrom finish, snapdir
In the cases where we are taking a write lock and are careful
enough that we know we should succeed (i.e, we assert(got)),
use the get_write_greedy() variant that skips the checks for
waiters (be they ops or backfill) that are normally necessary
to avoid starvation. We don't care about staration here
because our op is already in-progress and can't easily be
aborted, and new ops won't start because they do make those
checks.
Fixes: #8889 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Tue, 22 Jul 2014 20:11:42 +0000 (13:11 -0700)]
osd: allow greedy get_write() for ObjectContext locks
There are several lockers that need to take a write lock
because there is an operation that is already in progress and
know it is safe to do so. In particular, they need to skip
the starvation checks (op waiters, backfill waiting).
Sage Weil [Tue, 22 Jul 2014 20:37:20 +0000 (13:37 -0700)]
os/KeyValueStore: make get_max_object_name_length() sane
This is getting the NAME_MAX from the OS, but in reality the backend
KV store is the limiter. And for leveldb, there is no real limit.
Return 4096 for now.
mon: AuthMonitor: always encode full regardless of keyserver having keys
On clusters without cephx, assuming an admin never added a key to the
cluster, the monitors have empty key servers. A previous patch had the
AuthMonitor not encoding an empty keyserver as a full version.
As such, whenever the monitor restarts we will have to read the whole
state from disk in the form of incrementals. This poses a problem upon
trimming, as we do every now and then: whenever we start the monitor, it
will start with an empty keyserver, waiting to be populated from whatever
we have on disk. This is performed in update_from_paxos(), and the
AuthMonitor's will rely on the keyserver version to decide which
incrementals we care about -- basically, all versions > keyserver version.
Although we started with an empty keyserver (version 0) and are expecting
to read state from disk, in this case it means we will attempt to read
version 1 first. If the cluster has been running for a while now, and
even if no keys have been added, it's fair to assume that version is
greater than 0 (or even 1), as the AuthMonitor also deals and keeps track
of auth global ids. As such, we expect to read version 1, then version 2,
and so on. If we trim at some point however this will not be possible,
as version 1 will not exist -- and we will assert because of that.
This is fixed by ensuring the AuthMonitor keeps track of full versions
of the key server, even if it's of an empty key server -- it will still
keep track of the key server's version, which is incremented each time
we update from paxos even if it is empty.
Fixes: #8851
Backport: dumpling, firefly
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
osd: init local_connection for fast_dispatch in _send_boot()
We were not properly setting up Sessions on the local_connection for
fast_dispatch'ed Messages if the cluster_addr was set explicitly: the OSD
was not in the dispatch list at bind() time (in ceph_osd.cc), and nothing
called it later on. This issue was missed in testing because Inktank only
uses unified NICs.
That led to errors like the following:
When do ec-read, i met a bug which was occured 100%. The messages are:
2014-07-14 10:03:07.318681 7f7654f6e700 -1 osd/OSD.cc: In function
'virtual void OSD::ms_fast_dispatch(Message*)' thread 7f7654f6e700 time
2014-07-14 10:03:07.316782 osd/OSD.cc: 5019: FAILED assert(session)
ceph version 0.82-585-g79f3f67 (79f3f6749122ce2944baa70541949d7ca75525e6)
1: (OSD::ms_fast_dispatch(Message*)+0x286) [0x6544b6]
2: (DispatchQueue::fast_dispatch(Message*)+0x56) [0xb059d6]
3: (DispatchQueue::run_local_delivery()+0x6b) [0xb08e0b]
4: (DispatchQueue::LocalDeliveryThread::entry()+0xd) [0xa4a5fd]
5: (()+0x8182) [0x7f7665670182]
6: (clone()+0x6d) [0x7f7663a1130d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
To resolve this, we have the OSD invoke ms_handle_fast_connect() explicitly
in send_boot(). It's not really an appropriate location, but we're already
doing a bunch of messenger twiddling there, so it's acceptable for now.
Signed-off-by: Ma Jianpeng <jianpeng.ma@intel.com> Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 9061988ec7eaa922e2b303d9eece86e7c8ee0fa1)
Haomai Wang [Mon, 14 Jul 2014 06:27:17 +0000 (14:27 +0800)]
Add rbdcache max dirty object option
Librbd will calculate max dirty object according to rbd_cache_max_size, it
doesn't suitable for every case. If user set image order 24, the calculating
result is too small for reality. It will increase the overhead of trim call
which is called each read/write op.
Now we make it as option for tunning, by default this value is calculated.
Haomai Wang [Mon, 14 Jul 2014 06:32:57 +0000 (14:32 +0800)]
Reduce ObjectCacher flush overhead
Flush op in ObjectCacher will iterate the whole active object set, each
dirty object also may own several BufferHead. If the object set is large,
it will consume too much time.
Use dirty_bh instead to reduce overhead. Now only dirty BufferHead will
be checked.
osd: init local_connection for fast_dispatch in _send_boot()
We were not properly setting up Sessions on the local_connection for
fast_dispatch'ed Messages if the cluster_addr was set explicitly: the OSD
was not in the dispatch list at bind() time (in ceph_osd.cc), and nothing
called it later on. This issue was missed in testing because Inktank only
uses unified NICs.
That led to errors like the following:
When do ec-read, i met a bug which was occured 100%. The messages are:
2014-07-14 10:03:07.318681 7f7654f6e700 -1 osd/OSD.cc: In function
'virtual void OSD::ms_fast_dispatch(Message*)' thread 7f7654f6e700 time
2014-07-14 10:03:07.316782 osd/OSD.cc: 5019: FAILED assert(session)
ceph version 0.82-585-g79f3f67 (79f3f6749122ce2944baa70541949d7ca75525e6)
1: (OSD::ms_fast_dispatch(Message*)+0x286) [0x6544b6]
2: (DispatchQueue::fast_dispatch(Message*)+0x56) [0xb059d6]
3: (DispatchQueue::run_local_delivery()+0x6b) [0xb08e0b]
4: (DispatchQueue::LocalDeliveryThread::entry()+0xd) [0xa4a5fd]
5: (()+0x8182) [0x7f7665670182]
6: (clone()+0x6d) [0x7f7663a1130d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
To resolve this, we have the OSD invoke ms_handle_fast_connect() explicitly
in send_boot(). It's not really an appropriate location, but we're already
doing a bunch of messenger twiddling there, so it's acceptable for now.
Signed-off-by: Ma Jianpeng <jianpeng.ma@intel.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Sage Weil [Sun, 20 Jul 2014 14:48:47 +0000 (07:48 -0700)]
os/FileStore: fix max object name limit
Our max object name is not limited by file name size, but by the length of
the name we can stuff in an xattr. That will vary from file system to
file system, so just make this 4096. In practice, it should be limited
via the global tunable, if it is adjusted at all.
rgw: account common prefixes for MaxKeys in bucket listing
To be more in line with the S3 api. Beforehand we didn't account the
common prefixes towards the MaxKeys (a single common prefix counts as a
single key). Also need to adjust the marker now if it is pointing at a
common prefix.
Only decode + characters to spaces if we're in a query argument. The +
query argument. The + => ' ' translation is not correct for
file/directory names.
Resolves http://tracker.ceph.com/issues/8702
Reviewed-by: Yehuda Sadeh <yehuda@redhat.com> Signed-off-by: Brian Rak <dn@devicenull.org>
If found a prefix, calculate a string greater than that so that next
request we can skip to that. This is still not the most efficient way to
do it. It'll be better to push it down to the objclass, but that'll
require a much bigger change.
Sage Weil [Fri, 18 Jul 2014 17:42:11 +0000 (10:42 -0700)]
os: add ObjectStore::get_max_attr_name_length()
Most importantly, capture that attrs on FileStore can't be more than about
100 chars. The Linux xattrs can only be 128 chars, but we also have some
prefixing we do.
Sage Weil [Wed, 16 Jul 2014 21:17:27 +0000 (14:17 -0700)]
osd: add config for osd_max_object_name_len = 2048 (was hard-coded at 4096)
Previously we had a hard coded limit of 4096. Objects > 3k crash the OSD
when running on ext4, although they probably work on xfs. But rgw only
generates objects a bit over 1024 bytes (maybe 1200 tops?), so let set a
more reasonable limit here. 2048 is a nice round number and should be
safe.
Add a test.
Fixes: #8174 Signed-off-by: Sage Weil <sage@redhat.com>
John Spray [Thu, 17 Jul 2014 12:15:45 +0000 (13:15 +0100)]
mds: fix journal reformat failure in standbyreplay
In the 0.82 release, standbyreplay MDS daemons would try
to reformat the jouranl if they saw an older version on
disk, where this should have only been done by the active
MDS for the rank. Depending on timing, this could cause
fatal corruption of the journal.
This change handles the following cases:
* only do reformat if not in standbyreplay (else raise EAGAIN
to keep trying til an active mds reformats it)
* if journal header goes away while in standbyreplay then raise
EAGAIN (handle rewrite happening in background)
* if journal version is greater than the max supported, suicide
Fixes: #8811 Signed-off-by: John Spray <john.spray@redhat.com>
John Spray [Thu, 17 Jul 2014 12:15:10 +0000 (13:15 +0100)]
osdc/Journaler: validate header on load and save
Previously if the journal header contained invalid
write, expire or trimmed offsets, we would end up
hitting a hard-to-understand assertion much later.
Instead, raise the error right away if the fields
are identifiably bad at load time, and assert that
they're valid before persisting them.