Sage Weil [Wed, 10 Jul 2013 18:02:08 +0000 (11:02 -0700)]
osd: limit number of inc osdmaps send to peers, clients
We should not send an unbounded number of inc maps to our peers or clients.
In particular, if a peer is not contacted for a while, we may think they
have a very old map (say, 10000 epochs ago) and send thousands of inc maps
when the distribution shifts and we need to peer.
Note that if we do not send enough maps, the peers will make do by
requesting the map from somewhere else (currently the mon). Regardless
of the source, however, we must limit the amount that we speculatively
share as it usually is not needed.
Backport: cuttlefish, bobtail Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 653e04a79430317e275dd77a46c2b17c788b860b)
Adjust the return value from rgw_get_user_info_by_swift call
in RGW_SWIFT_Auth_Get::execute() to have the correct
return code in response.
(cherry picked from commit 4089001de1f22d6acd0b9f09996b71c716235551)
Sage Weil [Tue, 9 Jul 2013 00:46:40 +0000 (17:46 -0700)]
mon/OSDMonitor: fix base case for loading full osdmap
Right after cluster creation, first_committed is 1 and latest stashed in 0,
but we don't have the initial full map yet. Thereafter, we do (because we
write it with trim). Fixes afd6c7d8247075003e5be439ad59976c3d123218.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 43fa7aabf1f7e5deb844c1f52d451bab9e7d1006)
Sage Weil [Mon, 8 Jul 2013 22:04:59 +0000 (15:04 -0700)]
mon: fix osdmap stash, trim to retain complete history of full maps
The current interaction between sync and stashing full osdmaps only on
active mons means that a sync can result in an incomplete osdmap_full
history:
- mon.c starts a full sync
- during sync, active osdmap service should_stash_full() is true and
includes a full in the txn
- mon.c sync finishes
- mon.c update_from_paxos gets "latest" stashed that it got from the
paxos txn
- mon.c does *not* walk to previous inc maps to complete it's collection
of full maps.
To fix this, we disable the periodic/random stash of full maps by the
osdmap service.
This introduces a new problem: we must have at least one full map (the first
one) in order for a mon that just synced to build it's full collection.
Extend the encode_trim() process to allow the osdmap service to include
the oldest full map with the trim txn. This is more complex than just
writing the full maps in the txn, but cheaper--we only write the full
map at trim time.
This *might* be related to previous bugs where the full osdmap was
missing, or case where leveldb keys seemed to 'disappear'.
Fixes: #5512
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit afd6c7d8247075003e5be439ad59976c3d123218)
Samuel Just [Wed, 3 Jul 2013 18:18:33 +0000 (11:18 -0700)]
Elector.h: features are 64 bit
Fixes: #5497 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Luis <joao.luis@inktank.com>
(cherry picked from commit 3564e304e3f50642e4d9ff25e529d5fc60629093)
Samuel Just [Wed, 3 Jul 2013 18:18:19 +0000 (11:18 -0700)]
ceph_features.h: declare all features as ULL
Otherwise, the first 32 get |'d together as ints. Then, the result
((int)-1) is sign extended to ((long long int)-1) before being |'d
with the 1LL entries. This results in ~((uint64_t)0).
Fixes: #5497 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Luis <joao.luis@inktank.com>
(cherry picked from commit 4255b5c2fb54ae40c53284b3ab700fdfc7e61748)
Samuel Just [Wed, 3 Jul 2013 04:09:36 +0000 (21:09 -0700)]
Pipe: use uint64_t not unsigned when setting features
Fixes: #5497 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Luis <joao.luis@inktank.com>
(cherry picked from commit bc3e2f09f8860555d8b3b49b2eea164b4118d817)
Sage Weil [Mon, 8 Jul 2013 18:24:48 +0000 (11:24 -0700)]
client: remove O_LAZY
The once-upon-a-time unique O_LAZY value I chose forever ago is now
O_NOATIME, which means that some clients are choosing relaxed
consistency without meaning to.
It is highly unlikely that a real O_LAZY will ever exist, and we can
select it in the ceph case with the ioctl or libcephfs call, so drop
any support for doing this via open(2) flags.
Update doc/lazy_posix.txt file re: lazy io.
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 94afedf02d07ad4678222aa66289a74b87768810)
Sage Weil [Fri, 5 Jul 2013 23:03:49 +0000 (16:03 -0700)]
mon: remove bad assert about monmap version
It is possible to start a sync when our newest monmap is 0. Usually we see
e0 from probe, but that isn't always published as part of the very first
paxos transaction due to the way PaxosService::_active generates it's
first initial commit.
In any case, having e0 here is harmless.
Fixes: #5509 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 85a1d6cc5d3852c94d1287b566656c5b5024fa13)
Sage Weil [Wed, 3 Jul 2013 23:56:06 +0000 (16:56 -0700)]
mon/Paxos: make 'paxos trim disabled max versions' much much larger
108000 is about 3 hours if paxos is going full-bore (1 proposal/second).
That ought to be pretty safe. Otherwise, we start trimming to soon and a
slow sync will just have to restart when it finishes.
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 71ebfe7e1abe4795b46cf00dfe1b03d1893368b0)
Sage Weil [Wed, 26 Jun 2013 13:01:40 +0000 (06:01 -0700)]
mon: do not reopen MonitorDBStore during startup
level doesn't seem to like this when it races with an internal compaction
attempt (see below). Instead, let the store get opened by the ceph_mon
caller, and pull a bit of the logic into the caller to make the flow a
little easier to follow.
Sage Weil [Tue, 2 Jul 2013 21:43:17 +0000 (14:43 -0700)]
sysvinit, upstart: handle symlinks to dirs in /var/lib/ceph/*
Match a symlink to a dir, not just dirs. This fixes the osd case of e.g.,
creating an osd in /data/osd$id in which ceph-disk makes a symlink from
/var/lib/ceph/osd/ceph-$id.
Fix proposed by Matt Thompson <matt.thompson@mandiant.com>; extended to
include the upstart users too.
Fixes: #5490 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 87c98e92d1375c8bc76196bbbf06f677bef95e64)
Sage Weil [Tue, 2 Jul 2013 00:33:11 +0000 (17:33 -0700)]
rgw: add RGWFormatter_Plain allocation to sidestep cranky strlen()
Valgrind complains about an invalid read when we don't pad the allocation,
and because it is inlined we can't whitelist it for valgrind. Workaround
the warning by just padding our allocations a bit.
Yan, Zheng [Wed, 15 May 2013 03:24:36 +0000 (11:24 +0800)]
mds: warn on unconnected snap realms
When there are more than one active MDS, restarting MDS triggers
assertion "reconnected_snaprealms.empty()" quite often. If there
is no snapshot in the FS, the items left in reconnected_snaprealms
should be other MDS' mdsdir. I think it's harmless.
If there are snapshots in the FS, the assertion probably can catch
real bugs. But at present, snapshot feature is broken, fixing it is
non-trivial. So replace the assertion with a warning.
Sage Weil [Wed, 26 Jun 2013 13:53:08 +0000 (06:53 -0700)]
mon/PGMonitor: use post_paxos_update, not init, to refresh from osdmap
We do two things here:
- make init an one-time unconditional init method, which is what the
health service expects/needs.
- switch PGMonitor::init to be post_paxos_update() which is called after
the other services update, which is what PGMonitor really needs.
This is a new version of the fix originally in commit a2fe0137946541e7b3b537698e1865fbce974ca6 (and those around it). That is,
this re-fixes a problem where osds do not see pg creates from their
subscribe due to map_pg_creates() not getting called.
Sage Weil [Thu, 20 Jun 2013 00:27:49 +0000 (17:27 -0700)]
ceph-disk: use unix lock instead of lockfile class
The lockfile class relies on file system trickery to get safe mutual
exclusion. However, the unix syscalls do this for us. More
importantly, the unix locks go away when the owning process dies, which
is behavior that we want here.
Fixes: #5387
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 2a4953b697a3464862fd3913336edfd7eede2487)
ceph-disk: make list_partition behave with unusual device names
When you get device names like sdaa you do not want to mistakenly conclude that
sdaa is a partition of sda. Use /sys/block/$device/$partition existence
instead.
Fixes: #5211
Backport: cuttlefish Signed-off-by: Alexandre Maragone <alexandre.maragone@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 8c0daafe003935881c5192e0b6b59b949269e5ae)
Sage Weil [Tue, 18 Jun 2013 03:28:24 +0000 (20:28 -0700)]
client: fix warning
client/Client.cc: In member function 'virtual void Client::ms_handle_remote_reset(Connection*)':
warning: client/Client.cc:7892:9: enumeration value 'STATE_NEW' not handled in switch [-Wswitch]
warning: client/Client.cc:7892:9: enumeration value 'STATE_OPEN' not handled in switch [-Wswitch]
warning: client/Client.cc:7892:9: enumeration value 'STATE_CLOSED' not handled in switch [-Wswitch]
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 8bd936f077530dfeb2e699164e4492b1c0973088)
Sage Weil [Tue, 25 Jun 2013 00:58:48 +0000 (17:58 -0700)]
mon/AuthMonitor: ensure initial rotating keys get encoded when create_initial called 2x
The create_initial() method may get called multiple times; make sure it
will unconditionally generate new/initial rotating keys. Move the block
up so that we can easily assert as much.
Sage Weil [Mon, 24 Jun 2013 19:52:44 +0000 (12:52 -0700)]
common/pick_addresses: behave even after internal_safe_to_start_threads
ceph-mon recently started using Preforker to working around forking issues.
As a result, internal_safe_to_start_threads got set sooner and calls to
pick_addresses() which try to set string config values now fail because
there are no config observers for them.
Work around this by observing the change while we adjust the value. We
assume pick_addresses() callers are smart enough to realize that their
result will be reflected by cct->_conf and not magically handled elsewhere.
Fixes: #5195, #5205
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit eb86eebe1ba42f04b46f7c3e3419b83eb6fe7f9a)
Sage Weil [Thu, 20 Jun 2013 22:39:23 +0000 (15:39 -0700)]
mon/PaxosService: allow paxos service writes while paxos is updating
In commit f985de28f86675e974ac7842a49922a35fe24c6c I mistakenly made
is_writeable() false while paxos was updating due to a misread of
Paxos::propose_new_value() (I didn't see that it would queue).
This is problematic because it narrows the window during which each service
is writeable for no reason.
Allow service to be writeable both when paxos is active and updating.
Sage Weil [Fri, 7 Jun 2013 18:14:58 +0000 (11:14 -0700)]
mon/Paxos: not readable when LOCKED
If we are re-proposing a previously accepted value from a previous quorum,
we should not consider it readable, because it is possible it was exposed
to clients as committed (2/3 accepted) but not recored to be committed, and
we do not want to expose old state as readable when new state was
previously readable.
Sage Weil [Fri, 31 May 2013 23:39:37 +0000 (16:39 -0700)]
mon/Paxos: go active *after* refreshing
The update_from_paxos() methods occasionally like to trigger new activity.
As long as they check is_readable() and is_writeable(), they will defer
until we go active and that activity will happen in the normal callbacks.
This fixes the problem where we active but is_writeable() is still false,
triggered by PGMonitor::check_osd_map().
Sage Weil [Sun, 2 Jun 2013 23:57:11 +0000 (16:57 -0700)]
mon/Paxos: do paxos refresh in finish_proposal; and refactor
Do the paxos refresh inside finish_proposal, ordered *after* the leader
assertion so that MonmapMonitor::update_from_paxos() calling bootstrap()
does not kill us.
Also, remove unnecessary finish_queued_proposal() and move the logic inline
where the bad leader assertion is obvious.
Sage Weil [Sun, 2 Jun 2013 23:14:01 +0000 (16:14 -0700)]
mon: explicitly refresh_from_paxos() when leveldb state changes
Instead of opportunistically calling each service's update_from_paxos(),
instead explicitly refresh all in-memory state whenever we know the
paxos state may have changed. This is simpler and less fragile.
Sage Weil [Thu, 9 May 2013 16:44:20 +0000 (09:44 -0700)]
osd: init test_ops_hook
CID 1019628 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
2. uninit_member: Non-static class member "test_ops_hook" is not initialized in this constructor nor in any functions that it calls.
Sage Weil [Thu, 9 May 2013 16:45:51 +0000 (09:45 -0700)]
osd: initialize OSDService::next_notif_id
CID 1019627 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
2. uninit_member: Non-static class member "next_notif_id" is not initialized in this constructor nor in any functions that it calls.
Sage Weil [Wed, 22 May 2013 21:29:37 +0000 (14:29 -0700)]
messages/MOSDMarkMeDown: fix uninit field
Fixes valgrind warning:
==14803== Use of uninitialised value of size 8
==14803== at 0x12E7614: sctp_crc32c_sb8_64_bit (sctp_crc32.c:567)
==14803== by 0x12E76F8: update_crc32 (sctp_crc32.c:609)
==14803== by 0x12E7720: ceph_crc32c_le (sctp_crc32.c:733)
==14803== by 0x105085F: ceph::buffer::list::crc32c(unsigned int) (buffer.h:427)
==14803== by 0x115D7B2: Message::calc_front_crc() (Message.h:441)
==14803== by 0x1159BB0: Message::encode(unsigned long, bool) (Message.cc:170)
==14803== by 0x1323934: Pipe::writer() (Pipe.cc:1524)
==14803== by 0x13293D9: Pipe::Writer::entry() (Pipe.h:59)
==14803== by 0x120A398: Thread::_entry_func(void*) (Thread.cc:41)
==14803== by 0x503BE99: start_thread (pthread_create.c:308)
==14803== by 0x6C6E4BC: clone (clone.S:112)
Sage Weil [Tue, 18 Jun 2013 03:32:15 +0000 (20:32 -0700)]
common/Preforker: fix warning
common/Preforker.h: In member function ‘int Preforker::signal_exit(int)’:
warning: common/Preforker.h:82:45: ignoring return value of ‘ssize_t safe_write(int, const void*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
This is harder than it should be to fix. :(
http://stackoverflow.com/questions/3614691/casting-to-void-doesnt-remove-warn-unused-result-error
Whatever, I guess we can do something useful with this return value.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit ce7b5ea7d5c30be32e4448ab0e7e6bb6147af548)
mon: Monitor: make sure we backup a monmap during sync start
First of all, we must find a monmap to backup. The newest version.
Secondly, we must make sure we back it up before clearing the store.
Finally, we must make sure that we don't remove said backup while
clearing the store; otherwise, we would be out of a backup monmap if the
sync happened to fail (and if the monitor happened to be killed before a
new sync had finished).
This patch makes sure these conditions are met.
Fixes: #5256 (partially)
Backport: cuttlefish
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 5e6dc4ea21b452e34599678792cd36ce1ba3edb3)
mon: Monitor: obtain latest monmap on sync store init
Always use the highest version amongst all the typically available
monmaps: whatever we have in memory, whatever we have under the
MonmapMonitor's store, and whatever we have backed up from a previous
sync. This ensures we always use the newest version we came across
with.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 6284fdce794b73adcc757fee910e975b6b4bd054)
mon: Monitor: don't remove 'mon_sync' when clearing the store during abort
Otherwise, we will end up losing the monmap we backed up when we started
the sync, and the monitor may be unable to start if it is killed or
crashes in-between the sync abort and finishing a new sync.
Fixes: #5256 (partially)
Backport: cuttlefish
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit af5a9861d7c6b4527b0d2312d0efa792910bafd9)
Sage Weil [Wed, 19 Jun 2013 04:31:23 +0000 (21:31 -0700)]
os/FileStore: drop posix_fadvise(...DONTNEED)
On XFS this call is problematic because it directly calls the filemap
writeback without vectoring through xfs. This can break the delicate
ordering of writeback and range zeroing; see #4976 and this thread
Sage Weil [Sun, 16 Jun 2013 03:06:33 +0000 (20:06 -0700)]
ceph-disk: do not stop activate-all on first failure
Keep going even if we hit one activation error. This avoids failing to
start some disks when only one of them won't start (e.g., because it
doesn't belong to the current cluster).
Sage Weil [Fri, 14 Jun 2013 22:01:14 +0000 (15:01 -0700)]
ceph.spec: install/uninstall init script
This was commented out almost years ago in commit 9baf5ef4 but it is not
clear to me that it was correct to do so. In any case, we are not
installing the rc.d links for ceph, which means it does not start up after
a reboot.
Sage Weil [Fri, 14 Jun 2013 20:34:40 +0000 (13:34 -0700)]
ceph-disk: add 'activate-all'
Scan /dev/disk/by-parttypeuuid for ceph OSDs and activate them all. This
is useful when the event didn't trigger on the initial udev event for
some reason.
Yehuda Sadeh [Fri, 14 Jun 2013 21:53:54 +0000 (14:53 -0700)]
rgw: escape prefix correctly when listing objects
Fixes: #5362
When listing objects prefix needs to be escaped correctly (the
same as with the marker). Otherwise listing objects with prefix
that starts with underscore doesn't work.
Backport: bobtail, cuttlefish
Sage Weil [Sat, 15 Jun 2013 15:14:40 +0000 (08:14 -0700)]
common/Preforker: fix broken recursion on exit(3)
If we exit via preforker, call exit(3) and not recursively back into
Preforker::exit(r). Otherwise you get a hang with the child blocked
at:
Thread 1 (Thread 0x7fa08962e7c0 (LWP 5419)):
#0 0x000000309860e0cd in write () from /lib64/libpthread.so.0
#1 0x00000000005cc906 in Preforker::exit(int) ()
#2 0x00000000005c8dfb in main ()
and the parent at
#0 0x000000309860eba7 in waitpid () from /lib64/libpthread.so.0
#1 0x00000000005cc87a in Preforker::parent_wait() ()
#2 0x00000000005c75ae in main ()
Sage Weil [Fri, 14 Jun 2013 04:56:23 +0000 (21:56 -0700)]
ceph-disk: do not use mount --move (or --bind)
The kernel does not let you mount --move when the parent mount is
shared (see, e.g., https://bugzilla.redhat.com/show_bug.cgi?id=917008
for another person this also confused). We can't use --bind either
since that (on RHEL at least) screws up /etc/mtab so that the final
result looks like
Sage Weil [Thu, 13 Jun 2013 22:54:58 +0000 (15:54 -0700)]
ceph-disk: implement 'activate-journal'
Activate an osd via its journal device. udev populates its symlinks and
triggers events in an order that is not related to whether the device is
an osd data partition or a journal. That means that triggering
'ceph-disk activate' can happen before the journal (or journal symlink)
is present and then fail.
Similarly, it may be that they are on different disks that are hotplugged
with the journal second.
This can be wired up to the journal partition type to ensure that osds are
started when the journal appears second.
Sage Weil [Wed, 12 Jun 2013 01:35:01 +0000 (18:35 -0700)]
ceph-disk: call partprobe outside of the prepare lock; drop udevadm settle
After we change the final partition type, sgdisk may or may not trigger a
udev event, depending on how well udev is behaving (it varies between
distros, it seems). The old code would often settle and wait for udev to
activate the device, and then partprobe would uselessly fail because it
was already mounted.
Call partprobe only at the very end, after prepare is done. This ensures
that if partprobe calls udevadm settle (which is sometimes does) we do not
get stuck.
Drop the udevadm settle. I'm not sure what this accomplishes; take it out,
at least until we determine we need it.