David Zafman [Tue, 9 Jul 2013 01:58:12 +0000 (18:58 -0700)]
osd: Clean-up redundant use of object_locator_t
Remove locator arg from get_object_context()/find_object_context()
Remove locator from object_info_t but retain encode format
Remove locator from object_info_t dump output
Remove OLOC_BLANK
Signed-off-by: David Zafman <david.zafman@inktank.com>
David Zafman [Tue, 11 Jun 2013 01:18:59 +0000 (18:18 -0700)]
librados, os, osd, osdc, test: Add support for client specified namespaces
Add rados_ioctx_namespace_set_key() and librados::IoCtx::namespace_set_key()
Add namespace to admin-daemon operations
Support namespace in osd map command
Add namespace to object_locator_t and hobject_t
Add random namespaces to psim program
Feature: #4982 (OSD: namespaces pt 1 (librados/osd, not caps))
Signed-off-by: David Zafman <david.zafman@inktank.com>
Samuel Just [Wed, 3 Jul 2013 18:18:33 +0000 (11:18 -0700)]
Elector.h: features are 64 bit
Fixes: #5497 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Luis <joao.luis@inktank.com>
Samuel Just [Wed, 3 Jul 2013 18:18:19 +0000 (11:18 -0700)]
ceph_features.h: declare all features as ULL
Otherwise, the first 32 get |'d together as ints. Then, the result
((int)-1) is sign extended to ((long long int)-1) before being |'d
with the 1LL entries. This results in ~((uint64_t)0).
Fixes: #5497 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Luis <joao.luis@inktank.com>
Samuel Just [Wed, 3 Jul 2013 04:09:36 +0000 (21:09 -0700)]
Pipe: use uint64_t not unsigned when setting features
Fixes: #5497 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Luis <joao.luis@inktank.com>
Sage Weil [Wed, 3 Jul 2013 19:20:45 +0000 (12:20 -0700)]
common: autoselect crc32c based on cpu features
If the CPu supposts SSE4.2, use the crc32c instructions. Use the magic
incantation from who knows where to do this. __builtin_cpu_supports()
is a nicer way to do it, but that is new in gcc 4.8.
Avoid static globals; they are bad. Sadly that means we redetect the CPU
feature on every call. I assume that is reasonably efficient...
Sage Weil [Tue, 2 Jul 2013 21:43:17 +0000 (14:43 -0700)]
sysvinit, upstart: handle symlinks to dirs in /var/lib/ceph/*
Match a symlink to a dir, not just dirs. This fixes the osd case of e.g.,
creating an osd in /data/osd$id in which ceph-disk makes a symlink from
/var/lib/ceph/osd/ceph-$id.
Fix proposed by Matt Thompson <matt.thompson@mandiant.com>; extended to
include the upstart users too.
Fixes: #5490 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
Sage Weil [Tue, 18 Jun 2013 03:54:15 +0000 (20:54 -0700)]
ceph-disk: make is_held() smarter about full disks
Handle the case where the device is a full disk. Make the partition
check a bit more robust (don't make assumptions about naming aside from
the device being a prefix of the partition).
Sage Weil [Tue, 2 Jul 2013 20:43:29 +0000 (13:43 -0700)]
osdc/Objecter: resend command map version checks on reconnect
We already do this for Ops and LingerOps, but missed this when we added
CommandOps to the mix. The result is that an ill-timed mon disconnect will
leave a command map check (and thus the command) hanging.
set object_info_t pool of an ObjectContext if it is undefined or bad
When reading object_info_t from an existing object attribute, the pool
may be < 0 and should be set to the pool containing the object. This is done
on the oi object on the stack but overriden later by:
obc->obs.oi.decode(bv);
This decode is superfluous and is removed so that it does not override
the modified value of the pool.
Sage Weil [Tue, 2 Jul 2013 00:33:11 +0000 (17:33 -0700)]
rgw: add RGWFormatter_Plain allocation to sidestep cranky strlen()
Valgrind complains about an invalid read when we don't pad the allocation,
and because it is inlined we can't whitelist it for valgrind. Workaround
the warning by just padding our allocations a bit.
Fixes: #5346
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Sat, 29 Jun 2013 01:15:23 +0000 (18:15 -0700)]
osd: set maximum object attr size
Make a well-defined maximum size of an object attribute. Since Linus has
a 64KB limit, and that is what we normally use to back this, use that as
the limit. This means that even when leveldb is backing large xattrs
(as ext4 users must do) we will return EFBIG on >64KB setxattr attempts.
Sage Weil [Fri, 28 Jun 2013 18:50:11 +0000 (11:50 -0700)]
client: fix remaining Inode::put() caller, and make method psuedo-private
Not sure I can make this actually private and make Client::put_inode() a
friend method (making all of Client a friend would defeat the purpose).
This works well enough, though!
Sage Weil [Fri, 28 Jun 2013 04:39:35 +0000 (21:39 -0700)]
client: use put_inode on MetaRequest inode refs
When we drop the request inode refs, we need to use put_inode() to ensure
they get cleaned up properly (removed from inode_map, caps released, etc.).
Do this explicitly here (as we do with all other inode put() paths that
matter).
Fixes: #5381
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Thu, 27 Jun 2013 00:34:39 +0000 (17:34 -0700)]
mon/PGMonitor: avoid duplicating map_pg_create() effort on same maps
If we have an election and refresh, but the osdmap does not change, there
is no need to recalculate the pg create maps. However, if we register new
creating pgs, we do... when the last_pg_scan update gets pulled out of
paxos (i.e., on both leader and peon mons).
Dan Mick [Wed, 26 Jun 2013 01:23:22 +0000 (18:23 -0700)]
Makefile.am: fix libglobal.la race with ceph_test_cors
ceph_test_cors had libglobal.la in its _LDFLAGS macro definition;
it should have been in _LDADD. Moreover, things using libglobal.la
ought to be using LIBGLOBAL_LDA to add it to _LDADD. Fix them all.
Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Sage Weil [Wed, 26 Jun 2013 13:53:08 +0000 (06:53 -0700)]
mon/PGMonitor: use post_paxos_update, not init, to refresh from osdmap
We do two things here:
- make init an one-time unconditional init method, which is what the
health service expects/needs.
- switch PGMonitor::init to be post_paxos_update() which is called after
the other services update, which is what PGMonitor really needs.
This is a new version of the fix originally in commit a2fe0137946541e7b3b537698e1865fbce974ca6 (and those around it). That is,
this re-fixes a problem where osds do not see pg creates from their
subscribe due to map_pg_creates() not getting called.
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Wed, 26 Jun 2013 13:01:40 +0000 (06:01 -0700)]
mon: do not reopen MonitorDBStore during startup
level doesn't seem to like this when it races with an internal compaction
attempt (see below). Instead, let the store get opened by the ceph_mon
caller, and pull a bit of the logic into the caller to make the flow a
little easier to follow.
Sage Weil [Tue, 25 Jun 2013 22:58:43 +0000 (15:58 -0700)]
mon/Paxos: update first_committed only from paxos
Do not touch the in-memory first_committed until the trim commits. This
avoids any possible confusion due to races and keeps commit() as similar
to store_state() as possible.
Similarly, do not touch first_committed from store_state. We should
*only* pull it out of the kv store.
Gary Lowell [Wed, 26 Jun 2013 13:27:17 +0000 (06:27 -0700)]
doc: public network statement needed on new monitors.
When using ceph-deploy to create a new monitor on a host that is not
in the initial set of hosts defined by the ceph-deploy new command,
a "public network" statement needs to be added to the ceph.conf file.
Fixes #5195.
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
Sage Weil [Tue, 25 Jun 2013 04:07:09 +0000 (21:07 -0700)]
mon/Paxos: assert that the store gives us back what we just wrote
In bug #5424 I observed leveldb failing internally and then returning
bad info. We then hit a random/confusing assert. Try to detect this
earlier by verifying that a get of a just-written last_committed gives
us back the right thing.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Thu, 20 Jun 2013 22:39:23 +0000 (15:39 -0700)]
mon/PaxosService: allow paxos service writes while paxos is updating
In commit f985de28f86675e974ac7842a49922a35fe24c6c I mistakenly made
is_writeable() false while paxos was updating due to a misread of
Paxos::propose_new_value() (I didn't see that it would queue).
This is problematic because it narrows the window during which each service
is writeable for no reason.
Allow service to be writeable both when paxos is active and updating.
Sage Weil [Tue, 25 Jun 2013 19:01:53 +0000 (12:01 -0700)]
mon/PGMonitor: store PGMap directly in store, bypassing PaxosService stash_full
Instead of encoding incrementals and periodically dumping the whole encoded
PGMap, instead store everything in a range of keys, and update them
between versions using transactions. The per-version values are now
breadcrumbs indicating which keys were dirtied so they can be refreshed
via update_from_paxos().
This has several benefits:
- we avoid every encoding the entire PGMap
- we avoid dumping that blob into leveldb keys
- we limit the amount of data living in forward-moving keys, which leveldb
has a hard time compacting away
- pgmap data instead lives over a fixed range of keys, which leveldb
excels at
- we only keep the latest copy of the PGMap (which is all we care about)
Loic Dachary [Tue, 25 Jun 2013 14:10:02 +0000 (16:10 +0200)]
get_xattr() can return more than 4KB
Instead of failing if the attribute to be returned is larger than 4KB,
double the buffer size each time librados.rados_getxattr returns
-errno.ERANGE and try again.
Loic Dachary [Tue, 25 Jun 2013 13:04:34 +0000 (15:04 +0200)]
skip TEST(EXT4StoreTest, _detect_fs) if DISK or MOUNTPOINT are undefined
The TEST(EXT4StoreTest, _detect_fs) test is meant to be run from
qa/workunits/filestore/filestore.sh, after the ext4 file system was
created. If the DISK and MOUNTPOINT environment variables are not
defined, display a message explaining the expected environment and
silentely skip the test. The tests in store_test.cc are not unit tests
because they depend on their environment.