From: Noah Watkins Date: Tue, 12 Feb 2013 01:34:21 +0000 (-0800) Subject: doc: document hadoop replication config X-Git-Tag: v0.58~58^2~1 X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=f923c8cd77d45b3fa0c4e827402c889a10dc7930;p=ceph.git doc: document hadoop replication config Signed-off-by: Noah Watkins --- diff --git a/doc/cephfs/hadoop.rst b/doc/cephfs/hadoop.rst index 7481b7f0d8a9..1d354af26c6a 100644 --- a/doc/cephfs/hadoop.rst +++ b/doc/cephfs/hadoop.rst @@ -3,7 +3,7 @@ Using Hadoop with CephFS ======================== Hadoop Configuration --------------------- +==================== This section describes the Hadoop configuration options used to control Ceph. These options are intended to be set in the Hadoop configuration file @@ -36,8 +36,102 @@ These options are intended to be set in the Hadoop configuration file | | | | | | | | +---------------------+--------------------------+----------------------------+ +|ceph.data.pools |List of Ceph data pools |Default value: default Ceph | +| |for storing file. |pool. | +| | | | +| | | | ++---------------------+--------------------------+----------------------------+ |ceph.localize.reads |Allow reading from file |Default value: true | | |replica objects | | | | | | | | | | +---------------------+--------------------------+----------------------------+ + +Support For Per-file Custom Replication +--------------------------------------- + +Hadoop users may specify a custom replication factor (e.g. 3 copies of each +block) when creating a file. However, object replication factors are +controlled on a per-pool basis in Ceph, and by default a Ceph file system will +contain a pre-configured pool. In order to support per-file replication Hadoop +can be configured to select from alternative pools when creating new files. + +Additional data pools can be specified using the ``ceph.data.pools`` +configuration option. The value of the option is a comma separated list of +pool names. The default Ceph pool will be used automatically if this +configuration option is omitted or the value is empty. For example, the +following configuration setting will consider the three pools listed. :: + + + ceph.data.pools + pool1,pool2,pool5 + + +Hadoop will not create pools automatically. In order to create a new pool with +a specific replication factor use the ``ceph osd pool create`` command, and then +set the ``size`` property on the pool using the ``ceph osd pool set`` command. For +more information on creating and configuring pools see the `RADOS Pool +documentation`_. + +.. _RADOS Pool documentation: ../../rados/operations/pools + +Once a pool has been created and configured the metadata service must be told +that the new pool may be used to store file data. A pool can be made available +for storing file system data using the ``ceph mds add_data_pool`` command. + +First, create the pool. In this example we create the ``hadoop1`` pool with +replication factor 1. :: + + ceph osd pool create hadoop1 + ceph osd set hadoop1 size 1 + +Next, determine the pool id. This can be done using the ``ceph osd dump`` +command. For example, we can look for the newly created ``hadoop1`` pool. :: + + ceph osd dump | grep hadoop1 + +The output should resemble:: + + pool 3 'hadoop1' rep size 1 min_size 1 crush_ruleset 0... + +where ``3`` is the pool id. Next we will use the pool id reference to register +the pool as a data pool for storing file system data. :: + + ceph mds add_data_pool 3 + +The final step is to configure Hadoop to consider this data pool when +selecting the target pool for new files. :: + + + ceph.data.pools + hadoop1 + + +Pool Selection Semantics +~~~~~~~~~~~~~~~~~~~~~~~~ + +The following semantics describe the rules by which Hadoop will choose a pool +given a desired replication factor and the set of pools specified using the +``ceph.data.pools`` configuration option. + +1. When no custom pools are specified the default Ceph data pool is used. +2. A custom pool with the same replication factor as the default Ceph data + pool will override the default. +3. A pool with a replication factor that matches the desired replication will + be chosen if it exists. +4. Otherwise, a pool with at least the desired replication factor will be + chosen, or the maximum possible. + +Debugging Pool Selection +~~~~~~~~~~~~~~~~~~~~~~~~ + +Hadoop will produce log file entry when it cannot determine the replication +factor of a pool (e.g. it is not configured as a data pool). The log message +will appear as follows:: + + Error looking up replication of pool: + +Hadoop will also produce a log entry when it wasn't able to select an exact +match for replication. This log entry will appear as follows:: + + selectDataPool path= pool:repl=: wanted=