Dan Mick [Fri, 23 Aug 2013 00:30:24 +0000 (17:30 -0700)]
ceph_rest_api.py: create own default for log_file
common/config thinks the default log_file for non-daemons should be "".
Override that so that the default is
/var/log/ceph/{cluster}-{name}.{pid}.log
since ceph-rest-api is more of a daemon than a client.
Fixes: #6099
Backport: dumpling Signed-off-by: Dan Mick <dan.mick@inktank.com>
David Disseldorp [Mon, 29 Jul 2013 15:05:44 +0000 (17:05 +0200)]
mds: remove waiting lock before merging with neighbours
CephFS currently deadlocks under CTDB's ping_pong POSIX locking test
when run concurrently on multiple nodes.
The deadlock is caused by failed removal of a waiting_locks entry when
the waiting lock is merged with an existing lock, e.g:
Note that the waiting 4116@1:1 lock entry is merged with the existing
4116@0:1 held lock to become a 4116@0:2 held lock. However, the now
handled 4116@1:1 waiting_locks entry remains.
When handling a lock request, the MDS calls adjust_locks() to merge
the new lock with available neighbours. If the new lock is merged,
then the waiting_locks entry is not located in the subsequent
remove_waiting() call because adjust_locks changed the new lock to
include the old locks.
This fix ensures that the waiting_locks entry is removed prior to
modification during merge.
Signed-off-by: David Disseldorp <ddiss@suse.de> Reviewed-by: Greg Farnum <greg@inktank.com>
Sage Weil [Thu, 22 Aug 2013 22:54:48 +0000 (15:54 -0700)]
mon/Paxos: fix another uncommitted value corner case
It is possible that we begin the paxos recovery with an uncommitted
value for, say, commit 100. During last/collect we discover 100 has been
committed already. But also, another node provides an uncommitted value
for 101 with the same pn. Currently, we refuse to learn it, because the
pn is not strictly > than our current uncommitted pn... even though it is
the next last_committed+1 value that we need.
There are two possible fixes here:
- make this a >= as we can accept newer values from the same pn.
- discard our uncommitted value metadata when we commit the value.
Let's do both!
Fixes: #6090 Signed-off-by: Sage Weil <sage@inktank.com>
Yehuda Sadeh [Mon, 19 Aug 2013 23:56:27 +0000 (16:56 -0700)]
rgw: bucket meta remove don't overwrite entry point first
Fixes: #6056
When removing a bucket metadata entry we first unlink the bucket
and then we remove the bucket entrypoint object. Originally
when unlinking the bucket we first overwrote the bucket entrypoint
entry marking it as 'unlinked'. However, this is not really needed
as we're just about to remove it. The original version triggered
a bug, as we needed to propagate the new header version first (which
we didn't do, so the subsequent bucket removal failed).
Sandon Van Ness [Fri, 23 Aug 2013 02:44:40 +0000 (19:44 -0700)]
QA: Compile fsstress if missing on machine.
Some distro's have a lack of ltp-kernel packages and all we need is
fstress. This just modified the shell script to download/compile
fstress from source and copy it to the right location if it doesn't
currently exist where it is expected. It is a very small/quick
compile and currently only SLES and debian do not have it already.
Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Sandon Van Ness <sandon@inktank.com>
Sandon Van Ness [Fri, 23 Aug 2013 02:44:40 +0000 (19:44 -0700)]
QA: Compile fsstress if missing on machine.
Some distro's have a lack of ltp-kernel packages and all we need is
fstress. This just modified the shell script to download/compile
fstress from source and copy it to the right location if it doesn't
currently exist where it is expected. It is a very small/quick
compile and currently only SLES and debian do not have it already.
Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Sandon Van Ness <sandon@inktank.com>
Gary Lowell [Thu, 22 Aug 2013 18:07:16 +0000 (11:07 -0700)]
ceph.spec.in: Don't invoke debug_package macro on centos.
If the redhat-rpm-config package is installed, the debuginfo rpms will
be built by default. The build will fail when the package installed
and the specfile also invokes the macro.
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
Alexandre Oliva [Thu, 22 Aug 2013 06:40:22 +0000 (03:40 -0300)]
enable mds rejoin with active inodes' old parent xattrs
When the parent xattrs of active inodes that the mds attempts to open
during rejoin lack pool info (struct_v < 5), this field will be filled
in with -1, causing the mds to retry fetching a backtrace with a pool
number that matches the expected value, which fails and causes the
err==-ENOENT branch to be taken and retry pool 1, which succeeds, but
with pool -1, and so keeps on bouncing between the two retry cases
forever.
This patch arranges for the mds to go along with pool -1 instead of
insisting that it be refetched, enabling it to complete recovery
instead of eating cpu, network bandwidth and metadata osd's resources
like there's no tomorrow, in what AFAICT is an infinite and very busy
loop.
This is not a new problem: I've had it even before upgrading from
Cuttlefish to Dumpling, I'd just never managed to track it down, and
force-unmounting the filesystem and then restarting the mds was an
easier (if inconvenient) work-around, particularly because it always
hit when the filesystem was under active, heavy-ish use (or there
wouldn't be much reason for caps recovery ;-)
There are two issues not addressed in this patch, however. One is
that nothing seems to proactively update the parent xattr when it is
found to be outdated, so it remains out of date forever. Not even
renaming top-level directories causes the xattrs to be recursively
rewritten. AFAICT that's a bug.
The other is that inodes that don't have a parent xattr (created by
even older versions of ceph) are reported as non-existing in the mds
rejoin message, because the absence of the parent xattr is signaled as
a missing inode (?failed to reconnect caps for missing inodes?). I
suppose this may cause more serious recovery problems.
I suppose a global pass over the filesystem tree updating parent
xattrs that are out-of-date would be desirable, if we find any parent
xattrs still lacking current information; it might make sense to
activate it as a background thread from the backtrace decoding
function, when it finds a parent xattr that's too out-of-date, or as a
separate client (ceph-fsck?).
Loic Dachary [Tue, 13 Aug 2013 15:28:31 +0000 (17:28 +0200)]
ReplicatedPG: create ObjectContext with SharedPtrRegistry
All new ObjectContext are replaced with calls to
SharedPtrRegistry::lookup_or_create to ensure that they are all
registered. Because the constructor is invoked with no argument, care
is taken to always initialize the destructor_callback data member
immediately afterwards.
ReplicatedPG::get_object_context contains a redundant call to
get_snapset_context that is removed.
Loic Dachary [Mon, 12 Aug 2013 14:47:42 +0000 (16:47 +0200)]
ReplicatedPG: ObjectContext is made compatible with SharedPtrRegistry
When creating a new object SharedPtrRegistry::lookup_or_create uses
the default ObjectContext constructor with no argument. The existing
ObjectContext constructor is modified to have no argument and the
initialization that was previously done within the constructor is done
by the caller (that only happens three times).
The ObjectContext::get method is removed: its only purpose is to
increment the ref.
The ObjectContext::registered data member is removed as well as all
the associated assert()
The ObjectContext::destructor_callback data member Context is added
and called by the destructor. It will allow the caller to perform
additional cleanup, if necessary.
All ObjectContext * data members are replaced with shared_ptr.
Loic Dachary [Thu, 15 Aug 2013 18:15:03 +0000 (20:15 +0200)]
ReplicatedPG: add Mutex to protect snapset_contexts
snapset_contexts_locks is added and locked in each function where
snapset_contexts or the SnapSetContext::ref data member needs to be
accessed or modified.
Loic Dachary [Mon, 12 Aug 2013 12:05:38 +0000 (14:05 +0200)]
sharedptr_registry: add a variant of get_next() and the empty() method
The SharedPtrRegistry::get_next() method with a value of type VPtr
instead of V is added because it is sometime more convenient to not
copy the value when walking the registry. The
SharedPtrRegistry::empty() predicate method is added.
Josh Durgin [Wed, 21 Aug 2013 21:28:49 +0000 (14:28 -0700)]
objecter: resend unfinished lingers when osdmap is no longer paused
Plain Ops that haven't finished yet need to be resent if the osdmap
transitions from full or paused to unpaused. If these Ops are
triggered by LingerOps, they will be cancelled instead (since
should_resend = false), but the LingerOps that triggered them will not
be resent.
Fix this by checking the registered flag for all linger ops, and
resending any of them that aren't paused anymore.
Fixes: #6070 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage.weil@inktank.com>
Yehuda Sadeh [Mon, 19 Aug 2013 15:40:16 +0000 (08:40 -0700)]
rgw: change cache / watch-notify init sequence
Fixes: #6046
We were initializing the watch-notify (through the cache
init) before reading the zone info which was much too
early, as we didn't have the control pool name yet. Now
simplifying init/cleanup a bit, cache doesn't call watch/notify
init and cleanup directly, but rather states its need
through a virtual callback.
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
vstart.sh: Adds more ENV variables to configure dev cluster
This patch adds a few ENV variables, so you can use vstart.sh
multiple time to launch multiple clusters
CEPH_DIR -> The working directory of the cluster
CEPH_DEV_DIR -> the dev directory of the cluster
CEPH_OUT_DIR -> the output directory of the cluster
CEPH_RGW_PORT -> the default radosgw port to start with
All theses new variables are set to default values if not specified,
which ones does not change the previous behaviour of vstart.sh
Sage Weil [Wed, 21 Aug 2013 05:39:09 +0000 (22:39 -0700)]
ceph-disk: partprobe after creating journal partition
At least one user reports that a partprobe is needed after creating the
journal partition. It is not clear why sgdisk is not doing it, but this
fixes ceph-disk for them, and should be harmless for other users.
Fixes: #5599 Tested-by: lurbs in #ceph Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Fri, 9 Aug 2013 19:40:34 +0000 (12:40 -0700)]
json_spirit: remove unused typedef
In file included from json_spirit/json_spirit_writer.cpp:7:0:
json_spirit/json_spirit_writer_template.h: In function 'String_type json_spirit::non_printable_to_string(unsigned int)':
json_spirit/json_spirit_writer_template.h:37:50: warning: typedef 'Char_type' locally defined but not used [-Wunused-local-typedefs]
typedef typename String_type::value_type Char_type;
Sage Weil [Tue, 20 Aug 2013 18:55:10 +0000 (11:55 -0700)]
common/crc32c: refactor a bit
- the generic function without the _le suffix (useless)
- use a static global so that detection only happens once
- make the structure a bit cleaner to plug in new implementations
Sage Weil [Tue, 20 Aug 2013 18:27:23 +0000 (11:27 -0700)]
mon/Paxos: always refresh after any store_state
If we store any new state, we need to refresh the services, even if we
are still in the midst of Paxos recovery. This is because the
subscription path will share any committed state even when paxos is
still recovering. This prevents a race like:
- we have maps 10..20
- we drop out of quorum
- we are elected leader, paxos recovery starts
- we get one LAST with committed states that trim maps 10..15
- we get a subscribe for map 10..20
- we crash because 10 is no longer on disk because the PaxosService
is out of sync with the on-disk state.
Fixes: #6045
Backport: dumpling Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Tue, 20 Aug 2013 18:23:46 +0000 (11:23 -0700)]
pybind: fix Rados.conf_parse_env test
This happens after we connect, which means we get ENOSYS always.
Instead, parse_env inside the normal setup method, which had the added
benefit of being able to debug these tests.
Backport: dumpling Signed-off-by: Sage Weil <sage@inktank.com>
Loic Dachary [Tue, 20 Aug 2013 14:17:10 +0000 (16:17 +0200)]
erasure code : plugin, interface and glossary documentation updates
* replace the erasure code plugin abstract interface with a doxygen link
that will be populated when the header shows in master
* update the plugin documentation to reflect the current draft implementation
* fix broken link to PGBackend-h
* add a glossary to define chunk, stripe, shard and strip with a drawing