Sage Weil [Tue, 9 Jul 2013 04:54:53 +0000 (21:54 -0700)]
mon: make service trim_to stateless
Call get_trim_to() when we need to know how much to trim (if any), and
calculate it then. No need to keep this in a hidden trim_version
variable and remember to update it. This drops several helpers and
accessors and makes get_trim_to() a single method that services need to
override.
Sage Weil [Tue, 9 Jul 2013 04:38:11 +0000 (21:38 -0700)]
mon/PaxosService: trim periodically instead of via propose_pending
We want to trim old states even if there is no update activity. For
example, if a long-running rebalance finishes all osdmap updates will
stop and we won't trim out old maps to free space.
Instead, trim at the same time as tick(). Remove the trim during
propose_pending() to force all trims through this path and avoid
introducing a new and rarely-exercised behavior.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Tue, 9 Jul 2013 00:46:40 +0000 (17:46 -0700)]
mon/OSDMonitor: fix base case for loading full osdmap
Right after cluster creation, first_committed is 1 and latest stashed in 0,
but we don't have the initial full map yet. Thereafter, we do (because we
write it with trim). Fixes afd6c7d8247075003e5be439ad59976c3d123218.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Mon, 8 Jul 2013 22:04:59 +0000 (15:04 -0700)]
mon: fix osdmap stash, trim to retain complete history of full maps
The current interaction between sync and stashing full osdmaps only on
active mons means that a sync can result in an incomplete osdmap_full
history:
- mon.c starts a full sync
- during sync, active osdmap service should_stash_full() is true and
includes a full in the txn
- mon.c sync finishes
- mon.c update_from_paxos gets "latest" stashed that it got from the
paxos txn
- mon.c does *not* walk to previous inc maps to complete it's collection
of full maps.
To fix this, we disable the periodic/random stash of full maps by the
osdmap service.
This introduces a new problem: we must have at least one full map (the first
one) in order for a mon that just synced to build it's full collection.
Extend the encode_trim() process to allow the osdmap service to include
the oldest full map with the trim txn. This is more complex than just
writing the full maps in the txn, but cheaper--we only write the full
map at trim time.
This *might* be related to previous bugs where the full osdmap was
missing, or case where leveldb keys seemed to 'disappear'.
unit tests for the ObjectContext methods ondisk_write_lock,
ondisk_write_unlock, ondisk_read_lock and ondisk_read_unlock.
A class derived from ::testing::Test is created with two sub-classes (
Thread_read_lock & Thread_write_lock ) to provide a separate thread
that can block with cond.Wait(). usleep(3) is used in the main thread
to wait for the expected side effect with increasing delays ( up to
MAX_DELAY ).
Sage Weil [Fri, 5 Jul 2013 23:03:49 +0000 (16:03 -0700)]
mon: remove bad assert about monmap version
It is possible to start a sync when our newest monmap is 0. Usually we see
e0 from probe, but that isn't always published as part of the very first
paxos transaction due to the way PaxosService::_active generates it's
first initial commit.
In any case, having e0 here is harmless.
Fixes: #5509 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Wed, 3 Jul 2013 23:56:06 +0000 (16:56 -0700)]
mon/Paxos: make 'paxos trim disabled max versions' much much larger
108000 is about 3 hours if paxos is going full-bore (1 proposal/second).
That ought to be pretty safe. Otherwise, we start trimming to soon and a
slow sync will just have to restart when it finishes.
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Wed, 3 Jul 2013 22:36:39 +0000 (15:36 -0700)]
osd/OSDMap: handle case where some new osds have hb_front and others don't
Do not assume that because at least one OSD has an hb_front addr that they
all do, or else we will end up assigning garbage here and later thinking
it is a addr (or, more precisely, != entity_addr_t()).
Fixes: #5460 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
Sage Weil [Tue, 25 Jun 2013 20:16:45 +0000 (13:16 -0700)]
osd: fix race when queuing recovery ops
Previously we would sample how many ops to start under the lock, drop it,
and start that many. This is racy because multiple threads can jump in
and we start too many ops. Instead, claim as many slots as we can and
release them back later if we do not end up using them.
Take care to re-wake the work-queue since we are releasing more resources
for wq use.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Wed, 5 Jun 2013 05:42:52 +0000 (22:42 -0700)]
osd: do not use temp_coll for single-step pushes
If we are recovering an object in a single step, there is no need to
write it to temp and then move it. Avoiding that is a very good thing
when the FileStore has to do an fsync() for non-btrfs fs's.
Signed-off-by: Sage Weil <sage@inktank.com>
Conflicts:
src/osd/ReplicatedPG.cc
Samuel Just [Wed, 3 Jul 2013 18:18:33 +0000 (11:18 -0700)]
Elector.h: features are 64 bit
Fixes: #5497 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Luis <joao.luis@inktank.com>
Samuel Just [Wed, 3 Jul 2013 18:18:19 +0000 (11:18 -0700)]
ceph_features.h: declare all features as ULL
Otherwise, the first 32 get |'d together as ints. Then, the result
((int)-1) is sign extended to ((long long int)-1) before being |'d
with the 1LL entries. This results in ~((uint64_t)0).
Fixes: #5497 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Luis <joao.luis@inktank.com>
Samuel Just [Wed, 3 Jul 2013 04:09:36 +0000 (21:09 -0700)]
Pipe: use uint64_t not unsigned when setting features
Fixes: #5497 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Luis <joao.luis@inktank.com>
Sage Weil [Wed, 3 Jul 2013 19:20:45 +0000 (12:20 -0700)]
common: autoselect crc32c based on cpu features
If the CPu supposts SSE4.2, use the crc32c instructions. Use the magic
incantation from who knows where to do this. __builtin_cpu_supports()
is a nicer way to do it, but that is new in gcc 4.8.
Avoid static globals; they are bad. Sadly that means we redetect the CPU
feature on every call. I assume that is reasonably efficient...
Sage Weil [Tue, 2 Jul 2013 21:43:17 +0000 (14:43 -0700)]
sysvinit, upstart: handle symlinks to dirs in /var/lib/ceph/*
Match a symlink to a dir, not just dirs. This fixes the osd case of e.g.,
creating an osd in /data/osd$id in which ceph-disk makes a symlink from
/var/lib/ceph/osd/ceph-$id.
Fix proposed by Matt Thompson <matt.thompson@mandiant.com>; extended to
include the upstart users too.
Fixes: #5490 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
Sage Weil [Tue, 18 Jun 2013 03:54:15 +0000 (20:54 -0700)]
ceph-disk: make is_held() smarter about full disks
Handle the case where the device is a full disk. Make the partition
check a bit more robust (don't make assumptions about naming aside from
the device being a prefix of the partition).
Sage Weil [Tue, 2 Jul 2013 20:43:29 +0000 (13:43 -0700)]
osdc/Objecter: resend command map version checks on reconnect
We already do this for Ops and LingerOps, but missed this when we added
CommandOps to the mix. The result is that an ill-timed mon disconnect will
leave a command map check (and thus the command) hanging.
set object_info_t pool of an ObjectContext if it is undefined or bad
When reading object_info_t from an existing object attribute, the pool
may be < 0 and should be set to the pool containing the object. This is done
on the oi object on the stack but overriden later by:
obc->obs.oi.decode(bv);
This decode is superfluous and is removed so that it does not override
the modified value of the pool.
Sage Weil [Tue, 2 Jul 2013 00:33:11 +0000 (17:33 -0700)]
rgw: add RGWFormatter_Plain allocation to sidestep cranky strlen()
Valgrind complains about an invalid read when we don't pad the allocation,
and because it is inlined we can't whitelist it for valgrind. Workaround
the warning by just padding our allocations a bit.
Fixes: #5346
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Sat, 29 Jun 2013 01:15:23 +0000 (18:15 -0700)]
osd: set maximum object attr size
Make a well-defined maximum size of an object attribute. Since Linus has
a 64KB limit, and that is what we normally use to back this, use that as
the limit. This means that even when leveldb is backing large xattrs
(as ext4 users must do) we will return EFBIG on >64KB setxattr attempts.
Sage Weil [Fri, 28 Jun 2013 18:50:11 +0000 (11:50 -0700)]
client: fix remaining Inode::put() caller, and make method psuedo-private
Not sure I can make this actually private and make Client::put_inode() a
friend method (making all of Client a friend would defeat the purpose).
This works well enough, though!
Sage Weil [Fri, 28 Jun 2013 04:39:35 +0000 (21:39 -0700)]
client: use put_inode on MetaRequest inode refs
When we drop the request inode refs, we need to use put_inode() to ensure
they get cleaned up properly (removed from inode_map, caps released, etc.).
Do this explicitly here (as we do with all other inode put() paths that
matter).
Fixes: #5381
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com>