]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
9 years agoos/bluestore/BlueStore: do WAL ops buffered to avoid RMW issues
Sage Weil [Tue, 22 Dec 2015 15:31:08 +0000 (10:31 -0500)]
os/bluestore/BlueStore: do WAL ops buffered to avoid RMW issues

We may have multiple WAL ops that do read/modify/write covering
the same blocks.  To avoid the complexity of identifying those
situations and ensuring that we, say, wait for writes to complete
before reading them back again, just make the IO buffered and let
the page cache handle that for us.

This fixes the failure of LibRadosAio.RoundTripWriteFull.

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agorocksdb: debug log writes/reads
Sage Weil [Mon, 21 Dec 2015 21:57:12 +0000 (16:57 -0500)]
rocksdb: debug log writes/reads

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore: handle both buffered and direct+async IO
Sage Weil [Mon, 21 Dec 2015 21:56:42 +0000 (16:56 -0500)]
os/bluestore: handle both buffered and direct+async IO

Prefer aio unless explicitly directed otherwise.

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlockDevice: rename bdev options
Sage Weil [Mon, 21 Dec 2015 21:45:02 +0000 (16:45 -0500)]
os/bluestore/BlockDevice: rename bdev options

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueStore: use BlueFS::get_usage()
Sage Weil [Mon, 21 Dec 2015 21:18:05 +0000 (16:18 -0500)]
os/bluestore/BlueStore: use BlueFS::get_usage()

...just so we log bdev utilization in the log.

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueFS: get_usage()
Sage Weil [Mon, 21 Dec 2015 21:17:47 +0000 (16:17 -0500)]
os/bluestore/BlueFS: get_usage()

Return (and log) usage for all bdevs.

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueFS: do not dirty file when overwriting bytes
Sage Weil [Mon, 21 Dec 2015 20:33:38 +0000 (15:33 -0500)]
os/bluestore/BlueFS: do not dirty file when overwriting bytes

The rocksdb log recycle option allows us to overwrite previously
allocated space in an old log file to avoid updating the file
metadata on normal file systems.  Take advantage of that here to
by implementing what is effectively O_NOCMTIME semantics: we do
not dirty the file metadata just because mtime is updated.
Instead, we dirty the file only if we allocate new space or if
the size has to be increased.

Note that on my NVME drive a single-thread rados bench test, we
jump from 30MB/sec to 50MB/sec 128KB writes as soon as we start
recycling previous logs (about 40 second into the run).

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueFS: ignore flush when buffer is small
Sage Weil [Mon, 21 Dec 2015 20:07:44 +0000 (15:07 -0500)]
os/bluestore/BlueFS: ignore flush when buffer is small

Rocksdb does a flush after every append, each of which is often
less than a full block.  This is very inefficient when our
_flush() will send that to disk (and block).

Avoid this most of the time by ignoring small flush requests
entirely, unless the force flag is set (e.g., by fsync).

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore: update freelist in individual transactions
Sage Weil [Mon, 21 Dec 2015 19:45:00 +0000 (14:45 -0500)]
os/bluestore: update freelist in individual transactions

We submit each operation's transaction individually to rocksdb,
and then since a final transction to flush them all.  However,
they may not commit atomically (all together), which means we
need to leave the individual freelist updates within each
transaction.

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore: better debugging on fsck alloc errors
Sage Weil [Mon, 21 Dec 2015 19:22:58 +0000 (14:22 -0500)]
os/bluestore: better debugging on fsck alloc errors

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoscript/crash_bdev: simple script to inject bdev failures
Sage Weil [Mon, 21 Dec 2015 18:54:35 +0000 (13:54 -0500)]
script/crash_bdev: simple script to inject bdev failures

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore: fail mount of fsck finds errors
Sage Weil [Mon, 21 Dec 2015 18:53:34 +0000 (13:53 -0500)]
os/bluestore: fail mount of fsck finds errors

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/fs/FS.h: fix aio_t::pread
Sage Weil [Mon, 21 Dec 2015 14:49:05 +0000 (09:49 -0500)]
os/fs/FS.h: fix aio_t::pread

Allocate aligned buffer.

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueStore: better error msg for bdev label check
Sage Weil [Mon, 21 Dec 2015 14:39:56 +0000 (09:39 -0500)]
os/bluestore/BlueStore: better error msg for bdev label check

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore: don't create block.{db,wal} by default
Sage Weil [Mon, 21 Dec 2015 14:00:17 +0000 (09:00 -0500)]
os/bluestore: don't create block.{db,wal} by default

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agovstart.sh: less noisy debug
Sage Weil [Mon, 21 Dec 2015 13:58:43 +0000 (08:58 -0500)]
vstart.sh: less noisy debug

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore: fix fsck contains vs intersects
Sage Weil [Mon, 21 Dec 2015 13:57:18 +0000 (08:57 -0500)]
os/bluestore: fix fsck contains vs intersects

Any overlap is an error.

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore: bluestore bluefs = true
Sage Weil [Sat, 19 Dec 2015 19:06:00 +0000 (14:06 -0500)]
os/bluestore: bluestore bluefs = true

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agorpm, debian: package ceph-bluefs-tool
Sage Weil [Fri, 18 Dec 2015 20:40:45 +0000 (15:40 -0500)]
rpm, debian: package ceph-bluefs-tool

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueStore: fix error path if label set fails
Sage Weil [Thu, 17 Dec 2015 19:11:07 +0000 (14:11 -0500)]
os/bluestore/BlueStore: fix error path if label set fails

Reported-by: David Zafman <dzafman@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
9 years agorocksdb: fix recycle replay
Sage Weil [Thu, 17 Dec 2015 19:12:36 +0000 (14:12 -0500)]
rocksdb: fix recycle replay

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoMakefile-rocksdb.am: update
Sage Weil [Thu, 17 Dec 2015 14:06:48 +0000 (09:06 -0500)]
Makefile-rocksdb.am: update

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore: default to 64k min_alloc_size
Sage Weil [Mon, 14 Dec 2015 21:33:38 +0000 (16:33 -0500)]
os/bluestore: default to 64k min_alloc_size

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueStore: fix _open_bdev() failure path
Sage Weil [Mon, 14 Dec 2015 21:28:22 +0000 (16:28 -0500)]
os/bluestore/BlueStore: fix _open_bdev() failure path

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agokv/RocksDBStore: behave if options string is empty
Sage Weil [Mon, 14 Dec 2015 21:27:17 +0000 (16:27 -0500)]
kv/RocksDBStore: behave if options string is empty

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore: clear coll_map on umount, fsck finish
Sage Weil [Mon, 14 Dec 2015 20:56:33 +0000 (15:56 -0500)]
os/bluestore: clear coll_map on umount, fsck finish

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/kstore/KStore: fix object key decode with key
Sage Weil [Mon, 14 Dec 2015 20:55:09 +0000 (15:55 -0500)]
os/kstore/KStore: fix object key decode with key

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueStore: fix object key decode with key
Sage Weil [Mon, 14 Dec 2015 19:59:17 +0000 (14:59 -0500)]
os/bluestore/BlueStore: fix object key decode with key

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoceph_objectstore_test: fix warning
Sage Weil [Wed, 9 Dec 2015 21:19:58 +0000 (16:19 -0500)]
ceph_objectstore_test: fix warning

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/KeyValueStore: drop kinetic #include
Sage Weil [Wed, 9 Dec 2015 21:19:07 +0000 (16:19 -0500)]
os/KeyValueStore: drop kinetic #include

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/kstore: add new KStore backend
Sage Weil [Thu, 10 Dec 2015 21:04:32 +0000 (16:04 -0500)]
os/kstore: add new KStore backend

This is based on BlueStore, but with all of the block-related code
and complexity ripped out, and a simple striping strategy added
in its place.

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/bluestore_types: localize types
Sage Weil [Thu, 10 Dec 2015 21:03:59 +0000 (16:03 -0500)]
os/bluestore/bluestore_types: localize types

Prefix with bluestore_

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore: add extent_ref_map_t
Sage Weil [Thu, 10 Dec 2015 22:27:04 +0000 (17:27 -0500)]
os/bluestore: add extent_ref_map_t

This will be used to refcount extents for some subset
of the store (objects with same name or hash value?).

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/FreelistManager: drop unused db ref
Sage Weil [Thu, 10 Dec 2015 21:03:41 +0000 (16:03 -0500)]
os/bluestore/FreelistManager: drop unused db ref

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore: record kv backend
Sage Weil [Thu, 10 Dec 2015 21:03:23 +0000 (16:03 -0500)]
os/bluestore: record kv backend

Record kv backend at mkfs time instead of relying on current value
of config option.

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore: statfs
Sage Weil [Thu, 10 Dec 2015 21:02:45 +0000 (16:02 -0500)]
os/bluestore: statfs

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlockDevice: inject block failures
Sage Weil [Fri, 4 Dec 2015 01:03:10 +0000 (20:03 -0500)]
os/bluestore/BlockDevice: inject block failures

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoceph_test_objectstore: clean up synthetic collections
Sage Weil [Thu, 3 Dec 2015 21:33:37 +0000 (16:33 -0500)]
ceph_test_objectstore: clean up synthetic collections

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore: block.db support
Sage Weil [Thu, 10 Dec 2015 22:17:45 +0000 (17:17 -0500)]
os/bluestore: block.db support

Support a mid- to fast device that will preferentially
store the rocksdb data (and wal, if block.wal is not
present).

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore: less debug noise
Sage Weil [Thu, 10 Dec 2015 22:17:10 +0000 (17:17 -0500)]
os/bluestore: less debug noise

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueFS: all overwrites on open_for_write
Sage Weil [Thu, 10 Dec 2015 22:21:03 +0000 (17:21 -0500)]
os/bluestore/BlueFS: all overwrites on open_for_write

rocksdb will occasionally overwrite an existing file
if it is not present/valid in the manifest.

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueStore: drop internal EnvMirror
Sage Weil [Wed, 25 Nov 2015 19:27:28 +0000 (14:27 -0500)]
os/bluestore/BlueStore: drop internal EnvMirror

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agorocksdb: pull up to master, include EnvMirror
Sage Weil [Fri, 11 Dec 2015 14:32:30 +0000 (09:32 -0500)]
rocksdb: pull up to master, include EnvMirror

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore: label all block devices
Sage Weil [Thu, 10 Dec 2015 22:20:25 +0000 (17:20 -0500)]
os/bluestore: label all block devices

Label all of our block devices with a simple label
that includes the osd_uuid.  Wire this into the
ObjectStore and OSD probe mechanism.

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueFS: flush log if needed
Sage Weil [Thu, 10 Dec 2015 22:19:29 +0000 (17:19 -0500)]
os/bluestore/BlueFS: flush log if needed

If a file has dirty metadata (but no dirty data), we
still need to flush the log when it is flushed.

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueFS: fix replay of unlink
Sage Weil [Thu, 10 Dec 2015 22:18:57 +0000 (17:18 -0500)]
os/bluestore/BlueFS: fix replay of unlink

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore: support second block.wal device
Sage Weil [Thu, 10 Dec 2015 22:15:57 +0000 (17:15 -0500)]
os/bluestore: support second block.wal device

Use this device for the bluefs log.

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueStore: fix zero gap bug
Sage Weil [Thu, 10 Dec 2015 22:15:33 +0000 (17:15 -0500)]
os/bluestore/BlueStore: fix zero gap bug

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore: disable overlay for now
Sage Weil [Thu, 10 Dec 2015 22:15:14 +0000 (17:15 -0500)]
os/bluestore: disable overlay for now

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlockDevice: restructure interface
Sage Weil [Fri, 27 Nov 2015 16:07:46 +0000 (11:07 -0500)]
os/bluestore/BlockDevice: restructure interface

use atomics, do not track in-flight extents or magically cope
with racing ios (that is the users responsibility).

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueFS: fix overwrite
Sage Weil [Thu, 10 Dec 2015 21:49:56 +0000 (16:49 -0500)]
os/bluestore/BlueFS: fix overwrite

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueFS: fix writes spanning extents
Sage Weil [Thu, 10 Dec 2015 22:10:02 +0000 (17:10 -0500)]
os/bluestore/BlueFS: fix writes spanning extents

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore: reenable rocksdb recycling
Sage Weil [Thu, 10 Dec 2015 22:09:51 +0000 (17:09 -0500)]
os/bluestore: reenable rocksdb recycling

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlockDevice: lock device while open
Sage Weil [Thu, 10 Dec 2015 21:45:04 +0000 (16:45 -0500)]
os/bluestore/BlockDevice: lock device while open

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlockDevice: debug read result
Sage Weil [Thu, 10 Dec 2015 21:44:42 +0000 (16:44 -0500)]
os/bluestore/BlockDevice: debug read result

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlockDevice: fix alignment check
Sage Weil [Thu, 10 Dec 2015 21:44:29 +0000 (16:44 -0500)]
os/bluestore/BlockDevice: fix alignment check

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlockDevice: check aio return values
Sage Weil [Thu, 10 Dec 2015 21:49:14 +0000 (16:49 -0500)]
os/bluestore/BlockDevice: check aio return values

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueFS: avoid lock during reads
Sage Weil [Thu, 10 Dec 2015 21:43:32 +0000 (16:43 -0500)]
os/bluestore/BlueFS: avoid lock during reads

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueFS: prevent read+write sharing
Sage Weil [Thu, 10 Dec 2015 21:43:14 +0000 (16:43 -0500)]
os/bluestore/BlueFS: prevent read+write sharing

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agovstart.sh: debug bluefs and rocksdb
Sage Weil [Thu, 10 Dec 2015 21:38:45 +0000 (16:38 -0500)]
vstart.sh: debug bluefs and rocksdb

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueFS: periodically compact log
Sage Weil [Thu, 10 Dec 2015 21:38:35 +0000 (16:38 -0500)]
os/bluestore/BlueFS: periodically compact log

Rewrite only the current metadata in a fresh log
periodically to free log space.

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueFS: simplify extent list
Sage Weil [Thu, 10 Dec 2015 21:37:55 +0000 (16:37 -0500)]
os/bluestore/BlueFS: simplify extent list

Merge contiguous extents.

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueFS: fix read
Sage Weil [Thu, 10 Dec 2015 21:37:35 +0000 (16:37 -0500)]
os/bluestore/BlueFS: fix read

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoceph_test_objectstore: trivial init fix
Sage Weil [Thu, 10 Dec 2015 21:35:37 +0000 (16:35 -0500)]
ceph_test_objectstore: trivial init fix

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agokv/RocksDBStore: rocksdb_separate_wal_dir option
Sage Weil [Thu, 10 Dec 2015 21:31:18 +0000 (16:31 -0500)]
kv/RocksDBStore: rocksdb_separate_wal_dir option

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueFS: ref count BlueFS::File *
Sage Weil [Thu, 10 Dec 2015 21:34:27 +0000 (16:34 -0500)]
os/bluestore/BlueFS: ref count BlueFS::File *

There are FileWriters that exist when the file is
deleted.

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueFS: readdir list dirs, too
Sage Weil [Thu, 10 Dec 2015 21:32:24 +0000 (16:32 -0500)]
os/bluestore/BlueFS: readdir list dirs, too

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoceph-bluefs-tool: simple tool to export bluefs content
Sage Weil [Thu, 10 Dec 2015 21:32:06 +0000 (16:32 -0500)]
ceph-bluefs-tool: simple tool to export bluefs content

Currently we just do a dump.  We'll add more
functionality later.

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueFS: many fixes
Sage Weil [Thu, 10 Dec 2015 21:30:47 +0000 (16:30 -0500)]
os/bluestore/BlueFS: many fixes

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueStore: share space with BlueFS
Sage Weil [Thu, 10 Dec 2015 21:16:57 +0000 (16:16 -0500)]
os/bluestore/BlueStore: share space with BlueFS

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlockDevice: move to simple mutex model
Sage Weil [Thu, 10 Dec 2015 21:17:53 +0000 (16:17 -0500)]
os/bluestore/BlockDevice: move to simple mutex model

Just for now, while we get the rest of this working.

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueFS: simple file system to back rocksdb
Sage Weil [Thu, 10 Dec 2015 21:07:15 +0000 (16:07 -0500)]
os/bluestore/BlueFS: simple file system to back rocksdb

BlueFS is a simple file system that will back rocksdb.
BlueRocksEnv is the rocksdb::Env implementation that
glues them together.

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoceph_test_objectstore: less verbose
Sage Weil [Fri, 27 Nov 2015 16:07:36 +0000 (11:07 -0500)]
ceph_test_objectstore: less verbose

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoceph_test_objectstore: less verbose on hash collision test
Sage Weil [Wed, 25 Nov 2015 19:21:12 +0000 (14:21 -0500)]
ceph_test_objectstore: less verbose on hash collision test

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlueStore: fix _do_read
Sage Weil [Thu, 10 Dec 2015 21:22:02 +0000 (16:22 -0500)]
os/bluestore/BlueStore: fix _do_read

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/StupidAllocator: fix locking
Sage Weil [Thu, 10 Dec 2015 21:21:48 +0000 (16:21 -0500)]
os/bluestore/StupidAllocator: fix locking

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/StupidAllocator: fix misc bugs
Sage Weil [Thu, 10 Dec 2015 21:16:41 +0000 (16:16 -0500)]
os/bluestore/StupidAllocator: fix misc bugs

Can't use invalid iterator; fix init_rm_free.

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/Allocator: init_rm_free
Sage Weil [Thu, 10 Dec 2015 21:08:32 +0000 (16:08 -0500)]
os/bluestore/Allocator: init_rm_free

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agokv/RocksDBStore: take custom Env
Sage Weil [Thu, 10 Dec 2015 21:10:24 +0000 (16:10 -0500)]
kv/RocksDBStore: take custom Env

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore: fix _do_read return value
Sage Weil [Thu, 10 Dec 2015 21:13:27 +0000 (16:13 -0500)]
os/bluestore: fix _do_read return value

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore/BlockDevice: fix read return value
Sage Weil [Thu, 10 Dec 2015 21:13:11 +0000 (16:13 -0500)]
os/bluestore/BlockDevice: fix read return value

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/bluestore: separate Allocator from freelist storage
Sage Weil [Thu, 10 Dec 2015 21:05:56 +0000 (16:05 -0500)]
os/bluestore: separate Allocator from freelist storage

FreelistManager perists our freelist.  Allocator is a policy that
allocates it.

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agonewstore -> bluestore
Sage Weil [Mon, 16 Nov 2015 21:42:08 +0000 (16:42 -0500)]
newstore -> bluestore

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/newstore: always create db.wal
Sage Weil [Mon, 16 Nov 2015 21:02:48 +0000 (16:02 -0500)]
os/newstore: always create db.wal

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/newstore: create db dir
Sage Weil [Mon, 16 Nov 2015 20:33:18 +0000 (15:33 -0500)]
os/newstore: create db dir

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/newstore: consume a raw block device
Sage Weil [Mon, 16 Nov 2015 18:35:37 +0000 (13:35 -0500)]
os/newstore: consume a raw block device

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/newstore: make collection_list tolerate sloppy start position
Sage Weil [Tue, 24 Nov 2015 19:12:05 +0000 (14:12 -0500)]
os/newstore: make collection_list tolerate sloppy start position

Because of this change (#6076), the hobject_t will contain pool id, hence
the ghobject_t having this hobject_t will be not equal to ghobject_t().

In newstore, this will cause assertion failure:
FAILED assert(k >= start_key && k < end_key)

The fix is to make compatible with previous change to create a
ghobject_t object with pool id and shard id in newstore.

Fixes: #13801
Reported-by: Zhi Zhang <zhangz.david@outlook.com>
Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/newstore: make key names more efficient
Sage Weil [Wed, 21 Oct 2015 20:45:11 +0000 (16:45 -0400)]
os/newstore: make key names more efficient

- pack u32 and u64 in binary (instead of in hex)
- avoid duplicating the object name while making things still
  sort by (key,name).  Use < when key < name, = when key == name,
  > when key > name) as a prefix.  And in the = case (which is
  basically always) include the name just once.

Note that this breaks on-disk compatibility.

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/newstore: fix collection_list vs max entries
Sage Weil [Fri, 16 Oct 2015 16:41:50 +0000 (12:41 -0400)]
os/newstore: fix collection_list vs max entries

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/newstore: do not set/change frag_size if there are overlays
Sage Weil [Wed, 14 Oct 2015 12:41:39 +0000 (08:41 -0400)]
os/newstore: do not set/change frag_size if there are overlays

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/newstore: define a fid_backpointer_t type
Sage Weil [Tue, 6 Oct 2015 23:05:42 +0000 (19:05 -0400)]
os/newstore: define a fid_backpointer_t type

Signed-off-by: Sage Weil <sage@redhat.com>
fix wal_oP_t

9 years agoos/newstoer: add newstore types to ceph-dencoder
Sage Weil [Tue, 6 Oct 2015 23:01:17 +0000 (19:01 -0400)]
os/newstoer: add newstore types to ceph-dencoder

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/newstore: set alloc hint on new frags
Sage Weil [Tue, 6 Oct 2015 01:42:09 +0000 (21:42 -0400)]
os/newstore: set alloc hint on new frags

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/newstore: dump onode contents
Sage Weil [Tue, 6 Oct 2015 12:55:27 +0000 (08:55 -0400)]
os/newstore: dump onode contents

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/newstore: fixed fragment size
Sage Weil [Mon, 21 Sep 2015 01:56:50 +0000 (21:56 -0400)]
os/newstore: fixed fragment size

Instead of a single, variable-length fragment for each object,
set a fixed size (newstore_min_frag_size = 1 MB) and stripe the
object over these.  The last fragment will be smaller
than 1 MB if the object is not a multiple of 1 MB.

On write, this is basically free: we can just as cheaply write
4 inodes created together and fsync them than we can one.  On
overwrite, it allows us to replace individual fragments and avoid
write-ahead many cases.

On read it is a bit slower because of inode lookups and disk
seeks.  In the common case (big object written sequentially) we
hope that fs prefetching will hide most of it (e.g., all inodes
will be loaded together in the same metadata btree node, and the
files' data is written sequentially on disk).

Allowing for a singe large fragment in the case of a sequentially
written large object may save us something, but it complicates
the code significantly.

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoos/newstore: recycle rocksdb log files
Sage Weil [Mon, 9 Nov 2015 22:14:45 +0000 (17:14 -0500)]
os/newstore: recycle rocksdb log files

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agorocksdb: latest master
Sage Weil [Mon, 9 Nov 2015 22:13:57 +0000 (17:13 -0500)]
rocksdb: latest master

Signed-off-by: Sage Weil <sage@redhat.com>
9 years agoMerge pull request #6649 from majianpeng/filesstore-lfnunlink
Sage Weil [Fri, 1 Jan 2016 14:49:52 +0000 (09:49 -0500)]
Merge pull request #6649 from majianpeng/filesstore-lfnunlink

osd: FileStore:: optimize lfn_unlink

Reviewed-by: Kefu Chai <kchai@redhat.com>
9 years agoMerge pull request #7017 from efirs/ef_atomic_ceph_tid
Sage Weil [Fri, 1 Jan 2016 14:48:10 +0000 (09:48 -0500)]
Merge pull request #7017 from efirs/ef_atomic_ceph_tid

osd: use atomic to generate ceph_tid

Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>
9 years agoMerge pull request #7077 from XinzeChi/wip-fix-wip-perf
Sage Weil [Fri, 1 Jan 2016 14:47:35 +0000 (09:47 -0500)]
Merge pull request #7077 from XinzeChi/wip-fix-wip-perf

osd: fix wip (l_osd_op_wip) perf counter and remove repop_map

Reviewed-by: Kefu Chai <kchai@redhat.com>