mantle: store balancer in RADOS, balancer version in MDSMap

author Michael Sevilla <mikesevilla3@gmail.com>

Tue, 2 Feb 2016 19:25:42 +0000 (11:25 -0800)

committer Michael Sevilla <mikesevilla3@gmail.com>

Tue, 25 Oct 2016 20:27:34 +0000 (13:27 -0700)
author Michael Sevilla <mikesevilla3@gmail.com>
Tue, 2 Feb 2016 19:25:42 +0000 (11:25 -0800)
committer Michael Sevilla <mikesevilla3@gmail.com>
Tue, 25 Oct 2016 20:27:34 +0000 (13:27 -0700)
diff --git a/doc/cephfs/experimental-features.rst b/doc/cephfs/experimental-features.rst

index 4b9049592d2a52538ff8e7eeb49af9e036f431a5..1f6e3c2af41cb9952d75fb2735a640c21ad38b5c 100644 (file)
--- a/doc/cephfs/experimental-features.rst
+++ b/doc/cephfs/experimental-features.rst
@@ -52,6 +52,14 @@ There are serious known bugs.
  Multi-MDS filesystems have always required explicitly increasing the "max_mds"
  value and have been further protected with the "allow_multimds" flag for Jewel.
  
+Mantle: Programmable Metadata Load Balancer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Mantle is a programmable metadata balancer built into the MDS. The idea is to
+protect the mechanisms for balancing load (migration, replication,
+fragmentation) but stub out the balancing policies using Lua. For details, see
+:doc:`/cephfs/mantle`.
+
  Snapshots
  ---------
  Like multiple active MDSes, CephFS is designed from the ground up to support
diff --git a/doc/cephfs/mantle.rst b/doc/cephfs/mantle.rst

new file mode 100644 (file)

index 0000000..9d0b6bd
--- /dev/null
+++ b/doc/cephfs/mantle.rst
@@ -0,0 +1,235 @@
+Mantle
+======
+
+Multiple, active MDSs can migrate directories to balance metadata load. The
+policies for when, where, and how much to migrate are hard-coded into the
+metadata balancing module. Mantle is a programmable metadata balancer built
+into the MDS. The idea is to protect the mechanisms for balancing load
+(migration, replication, fragmentation) but stub out the balancing policies
+using Lua. Mantle is based on [1] but the current implementation does *NOT*
+have the following features from that paper:
+
+1. Balancing API: in the paper, the user fills in when, where, how much, and
+   load calculation policies; currently, Mantle only requires that Lua policies
+   return a table of target loads (e.g., how much load to send to each MDS)
+2. "How much" hook: in the paper, there was a hook that let the user control
+   the fragment selector policy; currently, Mantle does not have this hook
+3. Instantaneous CPU utilization as a metric
+
+[1] Supercomputing '15 Paper:
+http://sc15.supercomputing.org/schedule/event_detail-evid=pap168.html
+
+Quickstart with vstart
+----------------------
+
+.. warning::
+
+    Developing balancers with vstart is difficult because running all daemons
+    and clients on one node can overload the system. Let it run for a while, even
+    though you will likely see a bunch of lost heartbeat and laggy MDS warnings.
+    Most of the time this guide will work but sometimes all MDSs lock up and you
+    cannot actually see them spill. It is much better to run this on a cluster.
+
+As a pre-requistie, we assume you've installed `mdtest
+<https://sourceforge.net/projects/mdtest/>`_ or pulled the `Docker image
+<https://hub.docker.com/r/michaelsevilla/mdtest/>`_. We use mdtest because we
+need to generate enough load to get over the MIN_OFFLOAD threshold that is
+arbitrarily set in the balancer. For example, this does not create enough
+metadata load:
+
+::
+
+    while true; do
+      touch "/cephfs/blah-`date`"
+    done
+
+
+Mantle with `vstart.sh`
+~~~~~~~~~~~~~~~~~~~~~
+
+1. Start Ceph and tune the logging so we can see migrations happen:
+
+::
+
+    ./vstart.sh -n -l
+    for i in a b c; do 
+      ./ceph --admin-daemon out/mds.$i.asok config set debug_ms 0
+      ./ceph --admin-daemon out/mds.$i.asok config set debug_mds 0
+      ./ceph --admin-daemon out/mds.$i.asok config set debug_mds_balancer 2
+      ./ceph --admin-daemon out/mds.$i.asok config set mds_beacon_grace 1500
+    done
+
+
+2. Put the balancer into RADOS:
+
+::
+
+    ./rados put --pool=cephfs_metadata_a greedyspill.lua mds/balancers/greedyspill.lua
+
+
+3. Activate Mantle:
+
+::
+
+    ./ceph mds set allow_multimds true --yes-i-really-mean-it
+    ./ceph mds set max_mds 5
+    ./ceph mds set balancer greedyspill.lua
+
+
+4. Mount CephFS in another window:
+
+::
+
+     ./ceph-fuse /cephfs -o allow_other &
+     tail -f out/mds.a.log
+
+
+   Note that if you look at the last MDS (which could be a, b, or c -- it's
+   random), you will see an an attempt to index a nil value. This is because the
+   last MDS tries to check the load of its neighbor, which does not exist.
+
+5. Run a simple benchmark. In our case, we use the Docker mdtest image to
+   create load. Assuming that CephFS is mounted in the first container, we can
+   share the mount and run 3 clients using: 
+
+::
+
+    for i in 0 1 2; do
+      docker run -d \
+        --name=client$i \
+        -v /cephfs:/cephfs \
+        michaelsevilla/mdtest \
+        -F -C -n 100000 -d "/cephfs/client-test$i"
+    done
+
+
+6. When you're done, you can kill all the clients with:
+
+::
+
+    for i in 0 1 2 3; do docker rm -f client$i; done
+
+
+Output
+~~~~~
+
+Looking at the log for the first MDS (could be a, b, or c), we see that
+everyone has no load:
+
+::
+
+    2016-08-21 06:44:01.763930 7fd03aaf7700  0 lua.balancer MDS0: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=1.0 queue_len=0.0 cpu_load_avg=1.35 > load=0.0
+    2016-08-21 06:44:01.763966 7fd03aaf7700  0 lua.balancer MDS1: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=1.35 > load=0.0
+    2016-08-21 06:44:01.763982 7fd03aaf7700  0 lua.balancer MDS2: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=1.35 > load=0.0
+    2016-08-21 06:44:01.764010 7fd03aaf7700  2 lua.balancer when: not migrating! my_load=0.0 hisload=0.0
+    2016-08-21 06:44:01.764033 7fd03aaf7700  2 mds.0.bal  mantle decided that new targets={}
+
+
+After the jobs starts, MDS0 gets about 1953 units of load. The greedy spill
+balancer dictates that half the load goes to your neighbor MDS, so we see that
+Mantle tries to send 1953 load units to MDS1.
+
+::
+
+    2016-08-21 06:45:21.869994 7fd03aaf7700  0 lua.balancer MDS0: < auth.meta_load=5834.188908912 all.meta_load=1953.3492228857 req_rate=12591.0 queue_len=1075.0 cpu_load_avg=3.05 > load=1953.3492228857
+    2016-08-21 06:45:21.870017 7fd03aaf7700  0 lua.balancer MDS1: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=3.05 > load=0.0
+    2016-08-21 06:45:21.870027 7fd03aaf7700  0 lua.balancer MDS2: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=3.05 > load=0.0
+    2016-08-21 06:45:21.870034 7fd03aaf7700  2 lua.balancer when: migrating! my_load=1953.3492228857 hisload=0.0
+    2016-08-21 06:45:21.870050 7fd03aaf7700  2 mds.0.bal  mantle decided that new targets={0=0,1=976.675,2=0}
+    2016-08-21 06:45:21.870094 7fd03aaf7700  0 mds.0.bal    - exporting [0,0.52287 1.04574] 1030.88 to mds.1 [dir 100000006ab /client-test2/ [2,head] auth pv=33 v=32 cv=32/0 ap=2+3+4 state=1610612802|complete f(v0 m2016-08-21 06:44:20.366935 1=0+1) n(v2 rc2016-08-21 06:44:30.946816 3790=3788+2) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 authpin=1 0x55d2762fd690]
+    2016-08-21 06:45:21.870151 7fd03aaf7700  0 mds.0.migrator nicely exporting to mds.1 [dir 100000006ab /client-test2/ [2,head] auth pv=33 v=32 cv=32/0 ap=2+3+4 state=1610612802|complete f(v0 m2016-08-21 06:44:20.366935 1=0+1) n(v2 rc2016-08-21 06:44:30.946816 3790=3788+2) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 authpin=1 0x55d2762fd690]
+
+
+Eventually load moves around:
+
+::
+
+    2016-08-21 06:47:10.210253 7fd03aaf7700  0 lua.balancer MDS0: < auth.meta_load=415.77414300449 all.meta_load=415.79000078186 req_rate=82813.0 queue_len=0.0 cpu_load_avg=11.97 > load=415.79000078186
+    2016-08-21 06:47:10.210277 7fd03aaf7700  0 lua.balancer MDS1: < auth.meta_load=228.72023977691 all.meta_load=186.5606496623 req_rate=28580.0 queue_len=0.0 cpu_load_avg=11.97 > load=186.5606496623
+    2016-08-21 06:47:10.210290 7fd03aaf7700  0 lua.balancer MDS2: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=1.0 queue_len=0.0 cpu_load_avg=11.97 > load=0.0
+    2016-08-21 06:47:10.210298 7fd03aaf7700  2 lua.balancer when: not migrating! my_load=415.79000078186 hisload=186.5606496623
+    2016-08-21 06:47:10.210311 7fd03aaf7700  2 mds.0.bal  mantle decided that new targets={}
+
+
+Implementation Details
+----------------------
+
+Most of the implementation is in MDBalancer. Metrics are passed to the balancer
+policies via the Lua stack and a list of loads is returned back to MDBalancer.
+It sits alongside the current balancer implementation and it's enabled with a
+Ceph CLI command ("ceph mds set balancer mybalancer.lua"). If the Lua policy
+fails (for whatever reason), we fall back to the original metadata load
+balancer. The balancer is stored in the RADOS metadata pool and a string in the
+MDSMap tells the MDSs which balancer to use.
+
+Exposing Metrics to Lua
+~~~~~~~~~~~~~~~~~~~~~~
+
+Metrics are exposed directly to the Lua code as global variables instead of
+using a well-defined function signature. There is a global "mds" table, where
+each index is an MDS number (e.g., 0) and each value is a dictionary of metrics
+and values. The Lua code can grab metrics using something like this:
+
+::
+
+    mds[0]["queue_len"]
+
+
+This is in contrast to cls-lua in the OSDs, which has well-defined arguments
+(e.g., input/output bufferlists). Exposing the metrics directly makes it easier
+to add new metrics without having to change the API on the Lua side; we want
+the API to grow and shrink as we explore which metrics matter. The downside of
+this approach is that the person programming Lua balancer policies has to look
+at the Ceph source code to see which metrics are exposed. We figure that the
+Mantle developer will be in touch with MDS internals anyways.
+
+The metrics exposed to the Lua policy are the same ones that are already stored
+in mds_load_t: auth.meta_load(), all.meta_load(), req_rate, queue_length,
+cpu_load_avg.
+
+Compile/Execute the Balancer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Here we use `lua_pcall` instead of `lua_call` because we want to handle errors
+in the MDBalancer. We do not want the error propagating up the call chain. The
+cls_lua class wants to handle the error itself because it must fail gracefully.
+For Mantle, we don't care if a Lua error crashes our balancer -- in that case,
+we'll fall back to the original balancer.
+
+The performance improvement of using `lua_call` over `lua_pcall` would not be
+leveraged here because the balancer is invoked every 10 seconds by default. 
+
+Returning Policy Decision to C++
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We force the Lua policy engine to return a table of values, corresponding to
+the amount of load to send to each MDS. These loads are inserted directly into
+the MDBalancer "my_targets" vector. We do not allow the MDS to return a table
+of MDSs and metrics because we want the decision to be completely made on the
+Lua side.
+
+Iterating through tables returned by Lua is done through the stack. In Lua
+jargon: a dummy value is pushed onto the stack and the next iterator replaces
+the top of the stack with a (k, v) pair. After reading each value, pop that
+value but keep the key for the next call to `lua_next`. 
+
+Debugging
+~~~~~~~~
+
+Logging in a Lua policy will appear in the MDS log. The syntax is the same as
+the cls logging interface:
+
+::
+
+    BAL_LOG(0, "this is a log message")
+
+
+It is implemented by passing a function that wraps the `dout` logging framework
+(`dout_wrapper`) to Lua with the `lua_register()` primitive. The Lua code is
+actually calling the `dout` function in C++.
+
+Testing
+~~~~~~
+
+Testing is done with the ceph-qa-suite (tasks.cephfs.test_mantle). We do not
+test invalid balancer logging and loading the actual Lua VM.
diff --git a/src/mds/MDBalancer.cc b/src/mds/MDBalancer.cc

index 33bbbe91f94064c485425029f0de3b3ca1c6f199..df66226f79d64a034af36cc76ecd5231389f3d4d 100644 (file)
--- a/src/mds/MDBalancer.cc
+++ b/src/mds/MDBalancer.cc
@@ -37,6 +37,7 @@ using std::map;
  using std::vector;
  
  #include "common/config.h"
+#include "common/errno.h"
  
  #define dout_subsys ceph_subsys_mds
  #undef DOUT_COND
@@ -170,6 +171,28 @@ mds_load_t MDBalancer::get_load(utime_t now)
    return load;
  }
  
+int MDBalancer::localize_balancer(string const balancer)
+{
+  int64_t pool_id = mds->mdsmap->get_metadata_pool();
+  string fname = "/tmp/" + balancer;
+
+  dout(15) << "looking for balancer=" << balancer << " in RADOS pool_id=" << pool_id << dendl;
+  object_t oid = object_t(balancer);
+  object_locator_t oloc(pool_id);
+  bufferlist data;
+  C_SaferCond waiter;
+  mds->objecter->read(oid, oloc, 0, 0, CEPH_NOSNAP, &data, 0, &waiter);
+  int r = waiter.wait();
+  if (r == 0) {
+    dout(15) << "write data from RADOS into fname=" << fname << " data=" << data.c_str() << dendl;
+    data.write_file(fname.c_str());
+  } else {
+    dout(0) << "tick could not find balancer " << balancer
+            << " in RADOS: " << cpp_strerror(r) << dendl;
+  }
+  return r;
+}
+
  void MDBalancer::send_heartbeat()
  {
    utime_t now = ceph_clock_now(g_ceph_context);
@@ -606,8 +629,13 @@ void MDBalancer::prep_rebalance(int beat)
  
  int MDBalancer::mantle_prep_rebalance()
  {
-  /* hard-code lua balancer */
-  string script = "BAL_LOG(0, \"I am mds \"..whoami)\n return {11, 12, 3}";
+  /* pull metadata balancer from RADOS */
+  string const balancer = mds->mdsmap->get_balancer();
+  if (balancer == "" || localize_balancer(balancer))
+    return -ENOENT;
+  ifstream f("/tmp/" + balancer);
+  string script((istreambuf_iterator<char>(f)),
+                 istreambuf_iterator<char>());
  
    /* prepare for balancing */
    int cluster_size = mds->get_mds_map()->get_num_in_mds();
diff --git a/src/mds/MDSMap.cc b/src/mds/MDSMap.cc

index eec9d468762908e41a751f0ac62137e17a67a748..a0d7336e2b691b9fd6c381eaa94dc30ec7897619 100644 (file)
--- a/src/mds/MDSMap.cc
+++ b/src/mds/MDSMap.cc
@@ -487,6 +487,7 @@ void MDSMap::encode(bufferlist& bl, uint64_t features) const
      ::encode(session_autoclose, bl);
      ::encode(max_file_size, bl);
      ::encode(max_mds, bl);
+    ::encode(balancer, bl);
      __u32 n = mds_info.size();
      ::encode(n, bl);
      for (map<mds_gid_t, mds_info_t>::const_iterator i = mds_info.begin();
@@ -515,6 +516,7 @@ void MDSMap::encode(bufferlist& bl, uint64_t features) const
      ::encode(session_autoclose, bl);
      ::encode(max_file_size, bl);
      ::encode(max_mds, bl);
+    ::encode(balancer, bl);
      __u32 n = mds_info.size();
      ::encode(n, bl);
      for (map<mds_gid_t, mds_info_t>::const_iterator i = mds_info.begin();
@@ -542,7 +544,7 @@ void MDSMap::encode(bufferlist& bl, uint64_t features) const
      return;
    }
  
-  ENCODE_START(5, 4, bl);
+  ENCODE_START(6, 4, bl);
    ::encode(epoch, bl);
    ::encode(flags, bl);
    ::encode(last_failure, bl);
@@ -551,6 +553,7 @@ void MDSMap::encode(bufferlist& bl, uint64_t features) const
    ::encode(session_autoclose, bl);
    ::encode(max_file_size, bl);
    ::encode(max_mds, bl);
+  ::encode(balancer, bl);
    ::encode(mds_info, bl, features);
    ::encode(data_pools, bl);
    ::encode(cas_pool, bl);
@@ -583,7 +586,7 @@ void MDSMap::decode(bufferlist::iterator& p)
    std::map<mds_rank_t,int32_t> inc;  // Legacy field, parse and drop
  
    cached_up_features = 0;
-  DECODE_START_LEGACY_COMPAT_LEN_16(5, 4, 4, p);
+  DECODE_START_LEGACY_COMPAT_LEN_16(6, 4, 4, p);
    ::decode(epoch, p);
    ::decode(flags, p);
    ::decode(last_failure, p);
@@ -592,6 +595,7 @@ void MDSMap::decode(bufferlist::iterator& p)
    ::decode(session_autoclose, p);
    ::decode(max_file_size, p);
    ::decode(max_mds, p);
+  ::decode(balancer, p);
    ::decode(mds_info, p);
    if (struct_v < 3) {
      __u32 n;
diff --git a/src/mds/MDSMap.h b/src/mds/MDSMap.h

index a1aa73b0e1be849d66f7bc25f544d226f90f1e7d..6cdb6b4b5200bae0f2d18834731b80a0f801085a 100644 (file)
--- a/src/mds/MDSMap.h
+++ b/src/mds/MDSMap.h
@@ -193,6 +193,7 @@ protected:
     */
  
    mds_rank_t max_mds; /* The maximum number of active MDSes. Also, the maximum rank. */
+  string balancer; /* The name and version of the metadata load balancer. */
  
    std::set<mds_rank_t> in;              // currently defined cluster
  
@@ -289,6 +290,9 @@ public:
    mds_rank_t get_max_mds() const { return max_mds; }
    void set_max_mds(mds_rank_t m) { max_mds = m; }
  
+  std::string get_balancer() const { return balancer; }
+  void set_balancer(std::string val) { balancer.assign(val); }
+
    mds_rank_t get_tableserver() const { return tableserver; }
    mds_rank_t get_root() const { return root; }
  
diff --git a/src/mds/balancers/greedyspill.lua b/src/mds/balancers/greedyspill.lua

new file mode 100644 (file)

index 0000000..c3e38fa
--- /dev/null
+++ b/src/mds/balancers/greedyspill.lua
@@ -0,0 +1,48 @@
+metrics = {"auth.meta_load", "all.meta_load", "req_rate", "queue_len", "cpu_load_avg"}
+
+-- Metric for balancing is the workload; also dumps metrics
+function mds_load()
+  for i=0, #mds do
+    s = "MDS"..i..": < "
+    for j=1, #metrics do
+      s = s..metrics[j].."="..mds[i][metrics[j]].." "
+    end
+    mds[i]["load"] = mds[i]["all.meta_load"]
+    BAL_LOG(0, s.."> load="..mds[i]["load"])
+  end
+end
+
+-- Shed load when you have load and your neighbor doesn't
+function when()
+  my_load = mds[whoami]["load"]
+  his_load = mds[whoami+1]["load"]
+  if my_load > 0.01 and his_load < 0.01 then
+    BAL_LOG(2, "when: migrating! my_load="..my_load.." hisload="..his_load)
+    return true
+  end
+  BAL_LOG(2, "when: not migrating! my_load="..my_load.." hisload="..his_load)
+  return false
+end
+
+-- Shed half your load to your neighbor
+-- neighbor=whoami+2 because Lua tables are indexed starting at 1
+function where()
+  targets = {}
+  for i=1, #mds+1 do
+    targets[i] = 0
+  end
+
+  targets[whoami+2] = mds[whoami]["load"]/2
+  return targets
+end
+
+mds_load()
+if when() then
+  return where()
+end
+
+targets = {}
+for i=1, #mds+1 do
+  targets[i] = 0
+end
+return targets
diff --git a/src/mon/MDSMonitor.cc b/src/mon/MDSMonitor.cc

index e5cd558b042f39db7810ba88e8221d5e26d0f24a..122253ab278b48653955ab543fabd1baa2d2f371 100644 (file)
--- a/src/mon/MDSMonitor.cc
+++ b/src/mon/MDSMonitor.cc
@@ -1911,6 +1911,15 @@ public:
            fs->mds_map.set_inline_data_enabled(false);
          });
        }
+    } else if (var == "balancer") {
+      ss << "setting the metadata load balancer to " << val;
+        fsmap.modify_filesystem(
+            fs->fscid,
+            [val](std::shared_ptr<Filesystem> fs)
+        {
+          fs->mds_map.set_balancer(val);
+        });
+      return true;
      } else if (var == "max_file_size") {
        if (interr.length()) {
         ss << var << " requires an integer value";
diff --git a/src/mon/MonCommands.h b/src/mon/MonCommands.h

index 1f9c91bf5ea729f198d270394da080325bf0088b..6832e4643a33e441800ae858e7711ea5b961ba34 100644 (file)
--- a/src/mon/MonCommands.h
+++ b/src/mon/MonCommands.h
@@ -330,7 +330,7 @@ COMMAND_WITH_FLAG("mds set_max_mds " \
         "set max MDS index", "mds", "rw", "cli,rest", FLAG(DEPRECATED))
  COMMAND_WITH_FLAG("mds set " \
         "name=var,type=CephChoices,strings=max_mds|max_file_size"
-       "|allow_new_snaps|inline_data|allow_multimds|allow_dirfrags " \
+       "|allow_new_snaps|inline_data|allow_multimds|allow_dirfrags|balancer " \
         "name=val,type=CephString "                                     \
         "name=confirm,type=CephString,req=false",                       \
         "set mds parameter <var> to <val>", "mds", "rw", "cli,rest", FLAG(DEPRECATED))
author	Michael Sevilla <mikesevilla3@gmail.com>
	Tue, 2 Feb 2016 19:25:42 +0000 (11:25 -0800)
committer	Michael Sevilla <mikesevilla3@gmail.com>
	Tue, 25 Oct 2016 20:27:34 +0000 (13:27 -0700)
doc/cephfs/experimental-features.rst		patch \| blob \| history
doc/cephfs/mantle.rst	[new file with mode: 0644]	patch \| blob
src/mds/MDBalancer.cc		patch \| blob \| history
src/mds/MDSMap.cc		patch \| blob \| history
src/mds/MDSMap.h		patch \| blob \| history
src/mds/balancers/greedyspill.lua	[new file with mode: 0644]	patch \| blob
src/mon/MDSMonitor.cc		patch \| blob \| history
src/mon/MonCommands.h		patch \| blob \| history