Noah Watkins [Sun, 5 Oct 2014 20:15:13 +0000 (13:15 -0700)]
client: clean-up objecter on failed client init
During mount() the objecter isn't shutdown if the mon client fails to
initialize. Objecter asserts in destructor expect it to have been
shutdown but this skipped.
hadoop@plana85:~$ ./hadoop/bin/hadoop fs -ls /
14/10/05 12:35:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
osdc/Objecter.cc: In function 'virtual Objecter::~Objecter()' thread 7ff422705700 time 2014-10-05 12:35:51.090776
osdc/Objecter.cc: 3927: FAILED assert(!m_request_state_hook)
ceph version 0.85-981-g25bcc39 (25bcc39bb809e2d13beea1529e4ab92d1b61fa5b)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0x7ff3f5c28f7f]
2: (Objecter::~Objecter()+0x397) [0x7ff3f5bca707]
3: (Objecter::~Objecter()+0x9) [0x7ff3f5bca8b9]
4: (Client::~Client()+0x7d) [0x7ff3f5b6770d]
5: (Client::~Client()+0x9) [0x7ff3f5b680a9]
6: (ceph_mount_info::mount(std::string const&)+0x149) [0x7ff3f5b1fa49]
7: (ceph_mount()+0x4e) [0x7ff3f5b1dcbe]
8: (Java_com_ceph_fs_CephMount_native_1ceph_1mount()+0xb7) [0x7ff4158b1c97]
9: [0x7ff41839dd68]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
Aborted
Johnu George [Mon, 29 Sep 2014 17:07:44 +0000 (10:07 -0700)]
Crush: Ensuring at most num-rep osds are selected
Crush temporary buffers are allocated as per replica size configured
by the user.When there are more final osds (to be selected as per
rule) than the replicas, buffer overlaps and it causes crash.Now, it
ensures that at most num-rep osds are selected even if more number of
osds are allowed by indep rule. The fix for firstn rules is already
merged as part of bug #9492. Required test files are added.
Johnu George [Wed, 24 Sep 2014 16:32:50 +0000 (09:32 -0700)]
Crush: Ensuring at most num-rep osds are selected
Crush temporary buffers are allocated as per replica size configured
by the user.When there are more final osds (to be selected as per
rule) than the replicas, buffer overlaps and it causes crash.Now, it
ensures that at most num-rep osds are selected even if more number of
osds are allowed by the rule.
mon: Monitor: let 'handle_command()' deal with caps validation
If a given client doesn't have the required caps when running a command,
it must receive an EACCES or EPERM reply. This is already handled by
Monitor::handle_command(), which does an exceptionally good job at it.
Therefore, and unlike other messages that do not expect return values,
we can't simply drop the message if the client doesn't have the
appropriate capabilities, or things can get very weird very fast from
the user's perspective. Dropping the message for a command without a
reply has roughly the same effect as loss of quorum (timeout, pipes
failing) and confusion may ensue from it.
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
qa/workunits: mon: auth_caps: account for mon blank caps
test creating and entity with blank caps with and without '--force'
being specified. without '--force' they must fail with EINVAL as the
monitor will not be able to parse them.
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
mon: MonCommands: adjust indentation for 'auth add'
Eye-candy. We changed indentation of a few other entries and this one
was just too darn obvious, itching all over, night terrors, the whole
nine yards.
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
qa/workunits: mon: auth_caps: variables must be local
We have variables with the same name that are being shared! We don't
hit any issues with it currently because the code just kind of works
even though that happens. Add a bit of new logic that relies on an
immutable return code (for instance) and we're in the woods.
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
mon: Monitor: create logical divisions on dispatch() based on msg nature
Instead of a single switch(), have multiple switch() and order them by
increasing necessity of privileges.
This patch thus divides the big switch into:
- messages not requiring auth/caps checks at all
- messages which caps shall be checked somewhere else
- messages the Monitor class needs to deal with but only require a
client to have enough caps for the monitor to consider handling them
- messages that only a monitor is allowed to send.
Backport: firefly
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
mon: Monitor: match command module caps against what's on MonCommands.h
We were checking the command's permissions against what we perceived as
the 'module' from parsing the command's "prefix" (specified by the
client). This caused troubles with cap checks for commands without a
submodule clearly defined, such as 'status' or 'health' (vs 'mon dump'
or 'osd pool set', which are of submodule 'mon' and 'osd' respectively).
As such, we now grab the command's submodule (right now solely for caps
checks) from the monitor's internal representation of the commands
(defined in mon/MonCommands.h and built at compile time and stashed in
'mon_commands'). Given that commands such as 'health', 'fsid' or
'status' have properly defined modules in MonCommands.h, we simply rely
on that representation for all commands. Which is what we should have
been doing from the start anyway, because we shouldn't be relying on the
client to point us to what we want to authenticate against.
Backport: firefly
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
mon: AuthMonitor: validate caps when creating or changing mon caps
The monitor doesn't really know how to validate caps not meant for it.
The MDS or the OSD may very well allow blank caps for instance, while
the monitor categorically does not. We can't simply state a capability
is invalid because we wouldn't take it as such.
On the other hand, we must check monitor caps and make sure they are
correct, otherwise malformed caps can go unnoticed for a while,
sometimes even being hard to understand what may have gone wrong.
Backport: firefly
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
The STATEDIR variable is used to initialize the bootstrap-osd keyring
before it gets a chance to be overriden by --statedir. Replace it with
{statedir} so that it can be substituted after all options have been
parsed.
Sage Weil [Thu, 2 Oct 2014 22:22:56 +0000 (15:22 -0700)]
osdc/Objecter: use SafeTimer; make callbacks race-tolerant
The RWTimer event cancellation is racy. Instead, just make all of our
callbacks tolerate cancellation races. This is already true of most of
them (in fact, they are probably broken because they try to take a write
lock while holding a read lock). Fix C_CancelOp so that it calls the
other op_cancel (that takes a tid).
Then switch the RWTimer back to a SafeTimer. Put it in unsafe callbacks
mode because we don't want to introduce lock cycles with timer_lock.
Fixes: #9582
See also: #9650 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Wed, 1 Oct 2014 13:02:02 +0000 (06:02 -0700)]
ceph.spec.: add epoch
This is done in fedora packaging. Do it here too so that you can move
between upstream packages (from ceph.com) and fedora and other derivatives
will builds.
Backport: firefly, dumpling Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Wed, 1 Oct 2014 00:18:49 +0000 (17:18 -0700)]
mon: wait for paxos writes before touching state
Fix two other paths where may change the mon state so that we wait for the
pending write first. start_election() in particular can be triggered at
almost any time if we see an election message from another mon.
Locker: accept ctime updates from clients without dirty write caps
The ctime changes any time the inode does. That can happen even without
the file itself having changed, so we'd better accept the update whenever
the auth caps have dirtied, without worrying about the file caps!
Fixes: #9514
Backport: firefly
Signed-off-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@redhat.com> Reviewed-by: John Spray <john.spray@redhat.com>
qa/workunits/cephtool/test.sh: fix thrash (ultimate)
Keep the osd trash test to ensure it is a valid command but make it a
noop by giving it a zero argument (meaning thrash 0 OSD maps).
Remove the loops that were added after the command in an attempt to wait
for the cluster to recover and not pollute the rest of the tests. Actual
testing of osd thrash would require a dedicated cluster because it the
side effects are random and it is unnecessarily difficult to ensure they
are finished.
Sage Weil [Thu, 25 Sep 2014 19:34:11 +0000 (12:34 -0700)]
osdc/Objecter: only post_rx_buffer if no op timeout
If we post an rx buffer and there is a timeout, the revocation can happen
while the reader has consumed the buffers but before it has decoded and
constructed the message. In particular, we calculate a crc32c over the
data portion of the message after we've taken the buffers and dropped the
lock.
Instead of fixing this race (for example, by reverifying rx_buffers under
the lock while calculating the crc.. bleh), just skip the rx buffer
optimization entirely when a timeout is present.
Note that this doesn't cover the op_cancel() paths, but none of those users
provide static buffers to read into.
Fixes: #9582
Backport: firefly, dumpling Signed-off-by: Sage Weil <sage@redhat.com>
erasure-code: test isa encode/decode with various object sizes
Create an encode_decode() helper method to be called from the
encode_decode test function with various object size arguments. The
helper method is a copy/paste of the previous test that was using a
single object of a fixed size. The test is slightly adapted to
accommodate for different object sizes but the logic is not modified.
The object sizes being tested are chosen to be under the size of the
required size alignment or on multiple pages, size aligned or not.
John Spray [Thu, 25 Sep 2014 16:01:10 +0000 (17:01 +0100)]
msg: allow calling dtor immediately after ctor
Asserting on reaper_stop only made sense if the
messenger had ever been started: as it stood,
one couldn't create and destroy a messenger
without also starting and stopping it.
erasure-code: isa encode tests adapted to per chunk alignment
The encode tests use the alignment constraints. It has been changed to
be aligned on a per chunk basis instead of computing a more expensive
object alignement constraint. The test function is modified to take the
change into account but the logic is otherwise unmodified.
erasure-code: isa uses per chunk alignment constraints
Copy code from the jerasure plugin to enforce alignment constraints per
chunk instead of using the total object size. It is simpler and reduces
the size of the chunks. See
https://github.com/ceph/ceph/commit/c7daaaf5e63d0bd1d444385e62611fe276f6ce29
for more information.
Andreas Peters [Thu, 25 Sep 2014 14:48:47 +0000 (16:48 +0200)]
erasure-code: [ISA] modify get_alignment function to imply a platform/compiler independent alignment constraint of 32-byte aligned buffer addresses & length
Sage Weil [Tue, 23 Sep 2014 23:21:33 +0000 (16:21 -0700)]
osd: initialize purged_snap on backfill start; restart backfill if change
If we backfill a PG to a new OSD, we currently neglect to initialize
purged_snaps. As a result, the first time the snaptrimmer runs it has to
churn through every deleted snap for all time, and to make matters worse
does so in one go with the PG lock held. This leads to badness on any
cluster with a significant number of removed snaps that experiences
backfill.
Resolve this by initializing purged_snaps when we finish backfill. The
backfill itself will clear out any stray snaps and ensure the object set
is in sync with purged_snaps. Note that purged_snaps on the primary
that is driving backfill will not change during this period as the
snaptrimmer is not scheduled unless the PG is clean (which it won't be
during backfill).
If we by chance to interrupt backfill, go clean with other OSDs,
purge snaps, and then let this OSD rejoin, we will either restart
backfill (non-contiguous log) or the log will include the result of
the snap trim (the events that remove the trimmed snap).
Fixes: #9487
Backfill: firefly, dumpling Signed-off-by: Sage Weil <sage@redhat.com>
Otherwise statfs may fail if mkfs hasn't been run yet or if the monitor
data directory does not exist. There are checks to account for the mon
data dir not existing and we should wait for them to clear before we go
ahead and check the fs stats.
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
There are two new plugins (isa and lrc). When upgrading a cluster, there
must be a protection against the following scenario:
* the mon are upgraded but not the osd
* a new pool is created using plugin isa
* the osd fail to load the isa plugin because they have not been
upgraded
A feature bit is added : PLUGINS_V2. The monitor will only agree to
create an erasure code profile for the isa or lrc plugin if all OSDs
supports PLUGINS_V2. Once such an erasure code profile is stored in the
OSDMap, an OSD can only boot if it supports the PLUGINS_V2 feature,
which means it is able to load the isa and lrc plugins.
The monitors will only activate the PLUGINS_V2 feature if all monitors
in the quorum support it. It protects against the following scenario:
* the leader is upgraded the peons are not upgraded
* the leader creates a pool with plugin=lrc because all OSD have
the PLUGINS_V2 feature
* the leader goes down and a non upgraded peon becomes the leader
* an old OSD tries to join the cluster
* the new leader will let the OSD boot because it does not contain
the logic that would excluded it
* the old OSD will fail when required to load the plugin lrc
This is going to be needed each time new plugins are added, which is
impractical. A more generic plugin upgrade support should be added
instead, as described in http://tracker.ceph.com/issues/7291.
erasure-code: restore jerasure BlaumRoth default w
Changing from W=7 to W=6 by default for the BlaumRoth technique is
correct but introduces a regression. The content that was encoded with
the previous version cannot be read again. Although the prime(w+1)
constraint was not obeyed by W=7, the encoded content was useable and
should keep being readable.
The W=7 remains the default for backward compatibility and an exception
to the prime(w+1) check.