Last updated: 2017-04-08
The FreeBSD build will build most of the tools in Ceph.
-Note that the (kernel) RBD dependant items will not work
+Note that the (kernel) RBD dependent items will not work
I started looking into Ceph, because the HAST solution with CARP and
ggate did not really do what I was looking for. But I'm aiming for
11-RELEASE will also work. And Clang is at 3.8.0.
It uses the CLANG toolset that is available, 3.7 is no longer tested,
but was working when that was with 11-CURRENT.
- Clang 3.4 (on 10.2-STABLE) does not have all required capabilites to
+ Clang 3.4 (on 10.2-STABLE) does not have all required capabilities to
compile everything
The following setup will get things running for FreeBSD:
with all the packages FreeBSD already has in place. Lots of minute
details to figure out
- - Design a vitual disk implementation that can be used with behyve and
+ - Design a virtual disk implementation that can be used with behyve and
attached to an RBD image.
- **Filesystem**: The :term:`Ceph Filesystem` (CephFS) service provides
a POSIX compliant filesystem usable with ``mount`` or as
- a filesytem in user space (FUSE).
+ a filesystem in user space (FUSE).
Ceph can run additional instances of OSDs, MDSs, and monitors for scalability
and high availability. The following diagram depicts the high-level
The memory tracking used is currently imprecise by a constant factor. This
will be addressed in http://tracker.ceph.com/issues/22599. MDS deployments
with large `mds_cache_memory_limit` (64GB+) should underallocate RAM to
- accomodate.
+ accommodate.
See `User Management - Add a User to a Keyring`_. for additional details on user management
-To restrict a client to the specfied sub-directory only, we mention the specified
+To restrict a client to the specified sub-directory only, we mention the specified
directory while mounting using the following syntax. ::
./ceph-fuse -n client.*client_name* *mount_path* -r *directory_to_be_mounted*
``mds bal fragment interval``
-:Description: The delay (in seconds) between a fragment being elegible for split
+:Description: The delay (in seconds) between a fragment being eligible for split
or merge and executing the fragmentation change.
:Type: 32-bit Integer
:Default: ``5``
A directory's export pin is inherited from its closest parent with a set export
pin. In this way, setting the export pin on a directory affects all of its
-children. However, the parents pin can be overriden by setting the child
+children. However, the parents pin can be overridden by setting the child
directory's export pin. For example:
::
differences. For this reason, it's necessary during any cluster upgrade to
reduce the number of active MDS for a file system to one first so that two
active MDS do not communicate with different versions. Further, it's also
-necessary to take standbys offline as any new CompatSet flags will propogate
+necessary to take standbys offline as any new CompatSet flags will propagate
via the MDSMap to all MDS and cause older MDS to suicide.
The proper sequence for upgrading the MDS cluster is:
./ceph status
-Now put something in usin rados, check that it made it, get it back, and remove it.::
+Now put something in using rados, check that it made it, get it back, and remove it.::
./ceph osd pool create test-blkin 8
./rados put test-object-1 ./vstart.sh --pool=test-blkin
------------------------
A RADOS `SnapContext` consists of a snapshot sequence ID (`snapid`) and all
the snapshot IDs that an object is already part of. To generate that list, we
-combine `snapids` associated with the SnapRealm and all vaild `snapids` in
+combine `snapids` associated with the SnapRealm and all valid `snapids` in
`past_parent_snaps`. Stale `snapids` are filtered out by SnapClient's cached
effective snapshots.
There is a default value for every config option. In some cases, there may
also be a *daemon default* that only applies to code that declares itself
-as a daemon (in thise case, the regular default only applies to non-daemons).
+as a daemon (in this case, the regular default only applies to non-daemons).
Safety
------
*OpenFileTable*
Open file table tracks open files and their ancestor directories. Recovering
- MDS can easily get open files' pathes, significantly reducing the time of
+ MDS can easily get open files' paths, significantly reducing the time of
loading inodes for open files. Each entry in the table corresponds to an inode,
it records linkage information (parent inode and dentry name) of the inode. MDS
can constructs the inode's path by recursively lookup parent inode's linkage.
issue. The data of the temporary object wants to be located as close
to the data of the base object as possible. This may be best performed
by adding a new ObjectStore creation primitive that takes the base
-object as an addtional parameter that is a hint to the allocator.
+object as an additional parameter that is a hint to the allocator.
Sam: I think that the short lived thing may be a red herring. We'll
be updating the donor and primary objects atomically, so it seems like
might be desirable to rotate the shards based on object hash). Even
if you chose to designate a shard as witnessing all writes, the pg
might be degraded with that particular shard missing. This is a bit
-tricky, currently reads and writes implicitely return the most recent
+tricky, currently reads and writes implicitly return the most recent
version of the object written. On reads, we'd have to read K shards
to answer that question. We can get around that by adding a "don't
tell me the current version" flag. Writes are more problematic: we
indices iirc, and those will always be on replicated because they use
omap).
-We can avoid (1) by maintaining the missing set explicitely. It's
+We can avoid (1) by maintaining the missing set explicitly. It's
already possible for there to be a missing object without a
corresponding log entry (Consider the case where the most recent write
is to an object which has not been updated in weeks. If that write
would still let us implement and partially test the augmented backfill
code as well as the extra pg log entry fields -- this depends on the
explicit pg log entry branch having already merged. It's not entirely
-clear to me that this one is worth doing seperately. It's enough code
+clear to me that this one is worth doing separately. It's enough code
that I'd really prefer to get it done independently, but it's also a
fair amount of scaffolding that will be later discarded.
info.last_epoch_started >= MAX(history.last_epoch_started) must be an
upper bound on writes reported as committed to the client.
-We update info.last_epoch_started with the intial activation message,
+We update info.last_epoch_started with the initial activation message,
but we only update history.last_epoch_started after the new
info.last_epoch_started is persisted (possibly along with the first
write). This ensures that we do not require an osd with the most
the whole PG since it lets us represent the current state of the PG
using two numbers: the epoch of the map on the primary in which the
most recent write started (this is a bit stranger than it might seem
-since map distribution itself is asyncronous -- see Peering and the
+since map distribution itself is asynchronous -- see Peering and the
concept of interval changes) and an increasing per-pg version number
-- this is referred to in the code with type eversion_t and stored as
pg_info_t::last_update. Furthermore, we maintain a log of "recent"
bound the amount of outstanding IO we need to do to flush the journal.
At the same time, we don't want to necessarily do it inline in case we
might be able to combine several IOs on the same object close together
-in time. Thus, in FileStore::_write, we queue the fd for asyncronous
+in time. Thus, in FileStore::_write, we queue the fd for asynchronous
flushing and block in FileStore::_do_op if we have exceeded any hard
limits until the background flusher catches up.
#. hobject_t end
There are two types of backoff: a *PG* backoff will plug all requests
-targetting an entire PG at the client, as described by a range of the
+targeting an entire PG at the client, as described by a range of the
hash/hobject_t space [begin,end), while an *object* backoff will plug
-all requests targetting a single object (begin == end).
+all requests targeting a single object (begin == end).
When the client receives a *block* backoff message, it is now
responsible for *not* sending any requests for hobject_ts described by
+--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
| GET | Bucket requestPayment | No | | |
+--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
-| GET | Bucket versionning | No | | |
+| GET | Bucket versioning | No | | |
+--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
| GET | Bucket website | No | | |
+--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
| PUT | Bucket requestPayment | No | | |
+--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
-| PUT | Bucket versionning | No | | |
+| PUT | Bucket versioning | No | | |
+--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
| PUT | Bucket website | No | | |
+--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
``fsid`` term is used interchangeably with ``uuid``
OSD uuid
- Just like the OSD fsid, this is the OSD unique identifer and is used
+ Just like the OSD fsid, this is the OSD unique identifier and is used
interchangeably with ``fsid``
bluestore
dashboard functionality can be accessed by the user.
The Dashboard functionality/modules are grouped within a *security scope*.
-Security scopes are predefined and static. The current avaliable security
+Security scopes are predefined and static. The current available security
scopes are:
- **hosts**: includes all features related to the ``Hosts`` menu
Debugging
---------
-By default, a few debugging statments as well as error statements have been set to print in the log files. Users can add more if necessary.
+By default, a few debugging statements as well as error statements have been set to print in the log files. Users can add more if necessary.
To make use of the debugging option in the module:
- Add this to the ceph.conf file.::
includes external projects such as ceph-ansible, DeepSea, and Rook.
An *orchestrator module* is a ceph-mgr module (:ref:`mgr-module-dev`)
-which implements common managment operations using a particular
+which implements common management operations using a particular
orchestrator.
Orchestrator modules subclass the ``Orchestrator`` class: this class is
:Description: Largest number of PGs per "involved" OSD to let split create.
When we increase the ``pg_num`` of a pool, the placement groups
- will be splitted on all OSDs serving that pool. We want to avoid
+ will be split on all OSDs serving that pool. We want to avoid
extreme multipliers on PG splits.
:Type: Integer
:Default: 300
When you create pools and set the number of placement groups for the pool, Ceph
uses default values when you don't specifically override the defaults. **We
-recommend** overridding some of the defaults. Specifically, we recommend setting
+recommend** overriding some of the defaults. Specifically, we recommend setting
a pool's replica size and overriding the default number of placement groups. You
can specifically set these values when running `pool`_ commands. You can also
override the defaults by adding new ones in the ``[global]`` section of your
``packetsize={bytes}``
:Description: The encoding will be done on packets of *bytes* size at
- a time. Chosing the right packet size is difficult. The
+ a time. Choosing the right packet size is difficult. The
*jerasure* documentation contains extensive information
on this topic.
The sum of **k** and **m** must be a multiple of the **l** parameter.
The low level configuration parameters do not impose such a
-restriction and it may be more convienient to use it for specific
+restriction and it may be more convenient to use it for specific
purposes. It is for instance possible to define two groups, one with 4
chunks and another with 3 chunks. It is also possible to recursively
define locality sets, for instance datacenters and racks into
step 3 ____cDDD
are applied in order. For instance, if a 4K object is encoded, it will
-first go thru *step 1* and be divided in four 1K chunks (the four
+first go through *step 1* and be divided in four 1K chunks (the four
uppercase D). They are stored in the chunks 2, 3, 6 and 7, in
order. From these, two coding chunks are calculated (the two lowercase
c). The coding chunks are stored in the chunks 1 and 5, respectively.
While Ceph uses heartbeats to ensure that hosts and daemons are running, the
``ceph-osd`` daemons may also get into a ``stuck`` state where they are not
reporting statistics in a timely manner (e.g., a temporary network fault). By
-default, OSD daemons report their placement group, up thru, boot and failure
+default, OSD daemons report their placement group, up through, boot and failure
statistics every half second (i.e., ``0.5``), which is more frequent than the
heartbeat thresholds. If the **Primary OSD** of a placement group's acting set
fails to report to the monitor or if other OSDs have reported the primary OSD
POST /{bucket}?mdsearch
x-amz-meta-search: <key [; type]> [, ...]
-Multiple metadata fields must be comma seperated, a type can be forced for a
+Multiple metadata fields must be comma separated, a type can be forced for a
field with a `;`. The currently allowed types are string(default), integer and
date
* The sysvinit script now uses the ceph.conf file on the remote host
when starting remote daemons via the '-a' option. Note that if '-a'
- is used in conjuction with '-c path', the path must also be present
+ is used in conjunction with '-c path', the path must also be present
on the remote host (it is not copied to a temporary file, as it was
previously).
* mds: many fixes (Yan Zheng)
* mds: misc bug fixes with clustered MDSs and failure recovery
* mds: misc bug fixes with readdir
-* mds: new encoding for all data types (to allow forward/backward compatbility) (Greg Farnum)
+* mds: new encoding for all data types (to allow forward/backward compatibility) (Greg Farnum)
* mds: store and update backpointers/traces on directory, file objects (Sam Lang)
* mon: 'osd crush add|link|unlink|add-bucket ...' commands
* mon: ability to tune leveldb
* radosgw: fix object copy onto self (Yehuda Sadeh)
* radosgw: ACL grants in headers (Caleb Miles)
* radosgw: ability to listen to fastcgi via a port (Guilhem Lettron)
- * mds: new encoding for all data types (to allow forward/backward compatbility) (Greg Farnum)
+ * mds: new encoding for all data types (to allow forward/backward compatibility) (Greg Farnum)
* mds: fast failover between MDSs (enforce unique mds names)
* crush: ability to create, remove rules via CLI
* many many cleanups (Danny Al-Gaaf)
Manager to run. For high availability, Ceph Storage Clusters typically
run multiple Ceph Monitors so that the failure of a single Ceph
Monitor will not bring down the Ceph Storage Cluster. Ceph uses the
-Paxos algorithm, which requires a majority of monitors (i.e., greather
+Paxos algorithm, which requires a majority of monitors (i.e., greater
than *N/2* where *N* is the number of monitors) to form a quorum.
Odd numbers of monitors tend to be better, although this is not required.
both inodes and directories.
Auth pins can only exist for authoritative metadata, because they are
-only created if the object is authoritative, and their presense
+only created if the object is authoritative, and their presence
prevents the migration of authority.
Cache expiration messages that are received for a subtree that is
being exported are either deferred or handled immediately, based on
-the sender and reciever states. The importing MDS will always defer until
+the sender and receiver states. The importing MDS will always defer until
after the export finishes, because the import could fail. The exporting MDS
processes the expire UNLESS the expiring MDS does not know about the export or
the exporting MDS is no longer auth.
mds_kill_export_at:
1: After moving to STATE_EXPORTING
2: After sending MExportDirDiscover
-3: After recieving MExportDirDiscoverAck and auth_unpin'ing.
+3: After receiving MExportDirDiscoverAck and auth_unpin'ing.
4: After sending MExportDirPrep
5: After receiving MExportDirPrepAck
6: After sending out MExportDirNotify to all replicas
- locked: prevents dn read
- on auth
--> grab _all_ path pins at onces; hold none while waiting.
+-> grab _all_ path pins at once; hold none while waiting.
-> grab xlocks in order.
--- auth_pin = pin to authority, on *dir, *in
-> blocking on auth_pins is dangerous. _never_ block if we are holding other auth_pins on the same node (subtree?).
-> grab _all_ auth pins at once; hold none while waiting.
---- hard/file_wrlock = exlusive lock on inode content
+--- hard/file_wrlock = exclusive lock on inode content
- prevents inode read
- on auth
- order inodes on (ino);
- need to order both read and write locks, esp with dentries. so, if we need to lock /usr/bin/foo with read on usr and bin and xwrite on foo, we need to acquire all of those locks using the same ordering.
- on same host, we can be 'nice' and check lockability of all items, then lock all, and drop everything while waiting. (actually, is there any use to this?)
- - on mutiple hosts, we need to use full ordering (at least as things separate across host boundaries). and if needed lock set changes (such that the order of already acquired locks changes), we need to drop those locks and start over.
+ - on multiple hosts, we need to use full ordering (at least as things separate across host boundaries). and if needed lock set changes (such that the order of already acquired locks changes), we need to drop those locks and start over.
- how do auth pins fit into all this?
- auth pin on xlocks only. no need on read locks.
stored in the ``dist/`` directory. Use the ``-prod`` flag for a
production build. Navigate to ``https://localhost:8443``.
-Formating TS and SCSS files
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Formatting TS and SCSS files
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We use `Prettier <https://prettier.io/>`_ to automatically format TS and SCSS
files.
For ``GET`` and ``DELETE`` methods, the method's non-optional parameters are
considered path parameters by default. Optional parameters are considered
-query parameters. By specifing the ``query_parameters`` in the endpoint
+query parameters. By specifying the ``query_parameters`` in the endpoint
decorator it is possible to make a non-optional parameter to be a query
parameter.
Defining path parameters in endpoints's URLs using python methods's parameters
is very easy but it is still a bit strict with respect to the position of these
parameters in the URL structure.
-Sometimes we may want to explictly define a URL scheme that
+Sometimes we may want to explicitly define a URL scheme that
contains path parameters mixed with static parts of the URL.
Our controller infrastructure also supports the declaration of URL paths with
explicit path parameters at both the controller level and method level.
setting, and the python type of the value.
By declaring the ``ADMIN_EMAIL_ADDRESS`` class attribute, when you restart the
-dashboard plugin, you will atomatically gain two additional CLI commands to
+dashboard plugin, you will automatically gain two additional CLI commands to
get and set that setting::
$ ceph dashboard get-admin-email-address
The ``TaskExecutor`` class is responsible for code that executes a given task
-function, and defines three methods that can be overriden by
+function, and defines three methods that can be overridden by
subclasses::
def init(self, task)