From: Michael Sevilla Date: Tue, 2 Feb 2016 19:25:42 +0000 (-0800) Subject: mantle: store balancer in RADOS, balancer version in MDSMap X-Git-Tag: v11.1.0~502^2~1 X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=0829028d1c5a9fed9d7b258e57be550ea3b1d583;p=ceph.git mantle: store balancer in RADOS, balancer version in MDSMap - add docs and sample balancer (greedy-spill) Signed-off-by: Michael Sevilla --- diff --git a/doc/cephfs/experimental-features.rst b/doc/cephfs/experimental-features.rst index 4b9049592d2a..1f6e3c2af41c 100644 --- a/doc/cephfs/experimental-features.rst +++ b/doc/cephfs/experimental-features.rst @@ -52,6 +52,14 @@ There are serious known bugs. Multi-MDS filesystems have always required explicitly increasing the "max_mds" value and have been further protected with the "allow_multimds" flag for Jewel. +Mantle: Programmable Metadata Load Balancer +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Mantle is a programmable metadata balancer built into the MDS. The idea is to +protect the mechanisms for balancing load (migration, replication, +fragmentation) but stub out the balancing policies using Lua. For details, see +:doc:`/cephfs/mantle`. + Snapshots --------- Like multiple active MDSes, CephFS is designed from the ground up to support diff --git a/doc/cephfs/mantle.rst b/doc/cephfs/mantle.rst new file mode 100644 index 000000000000..9d0b6bd97f5f --- /dev/null +++ b/doc/cephfs/mantle.rst @@ -0,0 +1,235 @@ +Mantle +====== + +Multiple, active MDSs can migrate directories to balance metadata load. The +policies for when, where, and how much to migrate are hard-coded into the +metadata balancing module. Mantle is a programmable metadata balancer built +into the MDS. The idea is to protect the mechanisms for balancing load +(migration, replication, fragmentation) but stub out the balancing policies +using Lua. Mantle is based on [1] but the current implementation does *NOT* +have the following features from that paper: + +1. Balancing API: in the paper, the user fills in when, where, how much, and + load calculation policies; currently, Mantle only requires that Lua policies + return a table of target loads (e.g., how much load to send to each MDS) +2. "How much" hook: in the paper, there was a hook that let the user control + the fragment selector policy; currently, Mantle does not have this hook +3. Instantaneous CPU utilization as a metric + +[1] Supercomputing '15 Paper: +http://sc15.supercomputing.org/schedule/event_detail-evid=pap168.html + +Quickstart with vstart +---------------------- + +.. warning:: + + Developing balancers with vstart is difficult because running all daemons + and clients on one node can overload the system. Let it run for a while, even + though you will likely see a bunch of lost heartbeat and laggy MDS warnings. + Most of the time this guide will work but sometimes all MDSs lock up and you + cannot actually see them spill. It is much better to run this on a cluster. + +As a pre-requistie, we assume you've installed `mdtest +`_ or pulled the `Docker image +`_. We use mdtest because we +need to generate enough load to get over the MIN_OFFLOAD threshold that is +arbitrarily set in the balancer. For example, this does not create enough +metadata load: + +:: + + while true; do + touch "/cephfs/blah-`date`" + done + + +Mantle with `vstart.sh` +~~~~~~~~~~~~~~~~~~~~~ + +1. Start Ceph and tune the logging so we can see migrations happen: + +:: + + ./vstart.sh -n -l + for i in a b c; do + ./ceph --admin-daemon out/mds.$i.asok config set debug_ms 0 + ./ceph --admin-daemon out/mds.$i.asok config set debug_mds 0 + ./ceph --admin-daemon out/mds.$i.asok config set debug_mds_balancer 2 + ./ceph --admin-daemon out/mds.$i.asok config set mds_beacon_grace 1500 + done + + +2. Put the balancer into RADOS: + +:: + + ./rados put --pool=cephfs_metadata_a greedyspill.lua mds/balancers/greedyspill.lua + + +3. Activate Mantle: + +:: + + ./ceph mds set allow_multimds true --yes-i-really-mean-it + ./ceph mds set max_mds 5 + ./ceph mds set balancer greedyspill.lua + + +4. Mount CephFS in another window: + +:: + + ./ceph-fuse /cephfs -o allow_other & + tail -f out/mds.a.log + + + Note that if you look at the last MDS (which could be a, b, or c -- it's + random), you will see an an attempt to index a nil value. This is because the + last MDS tries to check the load of its neighbor, which does not exist. + +5. Run a simple benchmark. In our case, we use the Docker mdtest image to + create load. Assuming that CephFS is mounted in the first container, we can + share the mount and run 3 clients using: + +:: + + for i in 0 1 2; do + docker run -d \ + --name=client$i \ + -v /cephfs:/cephfs \ + michaelsevilla/mdtest \ + -F -C -n 100000 -d "/cephfs/client-test$i" + done + + +6. When you're done, you can kill all the clients with: + +:: + + for i in 0 1 2 3; do docker rm -f client$i; done + + +Output +~~~~~ + +Looking at the log for the first MDS (could be a, b, or c), we see that +everyone has no load: + +:: + + 2016-08-21 06:44:01.763930 7fd03aaf7700 0 lua.balancer MDS0: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=1.0 queue_len=0.0 cpu_load_avg=1.35 > load=0.0 + 2016-08-21 06:44:01.763966 7fd03aaf7700 0 lua.balancer MDS1: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=1.35 > load=0.0 + 2016-08-21 06:44:01.763982 7fd03aaf7700 0 lua.balancer MDS2: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=1.35 > load=0.0 + 2016-08-21 06:44:01.764010 7fd03aaf7700 2 lua.balancer when: not migrating! my_load=0.0 hisload=0.0 + 2016-08-21 06:44:01.764033 7fd03aaf7700 2 mds.0.bal mantle decided that new targets={} + + +After the jobs starts, MDS0 gets about 1953 units of load. The greedy spill +balancer dictates that half the load goes to your neighbor MDS, so we see that +Mantle tries to send 1953 load units to MDS1. + +:: + + 2016-08-21 06:45:21.869994 7fd03aaf7700 0 lua.balancer MDS0: < auth.meta_load=5834.188908912 all.meta_load=1953.3492228857 req_rate=12591.0 queue_len=1075.0 cpu_load_avg=3.05 > load=1953.3492228857 + 2016-08-21 06:45:21.870017 7fd03aaf7700 0 lua.balancer MDS1: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=3.05 > load=0.0 + 2016-08-21 06:45:21.870027 7fd03aaf7700 0 lua.balancer MDS2: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=3.05 > load=0.0 + 2016-08-21 06:45:21.870034 7fd03aaf7700 2 lua.balancer when: migrating! my_load=1953.3492228857 hisload=0.0 + 2016-08-21 06:45:21.870050 7fd03aaf7700 2 mds.0.bal mantle decided that new targets={0=0,1=976.675,2=0} + 2016-08-21 06:45:21.870094 7fd03aaf7700 0 mds.0.bal - exporting [0,0.52287 1.04574] 1030.88 to mds.1 [dir 100000006ab /client-test2/ [2,head] auth pv=33 v=32 cv=32/0 ap=2+3+4 state=1610612802|complete f(v0 m2016-08-21 06:44:20.366935 1=0+1) n(v2 rc2016-08-21 06:44:30.946816 3790=3788+2) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 authpin=1 0x55d2762fd690] + 2016-08-21 06:45:21.870151 7fd03aaf7700 0 mds.0.migrator nicely exporting to mds.1 [dir 100000006ab /client-test2/ [2,head] auth pv=33 v=32 cv=32/0 ap=2+3+4 state=1610612802|complete f(v0 m2016-08-21 06:44:20.366935 1=0+1) n(v2 rc2016-08-21 06:44:30.946816 3790=3788+2) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 authpin=1 0x55d2762fd690] + + +Eventually load moves around: + +:: + + 2016-08-21 06:47:10.210253 7fd03aaf7700 0 lua.balancer MDS0: < auth.meta_load=415.77414300449 all.meta_load=415.79000078186 req_rate=82813.0 queue_len=0.0 cpu_load_avg=11.97 > load=415.79000078186 + 2016-08-21 06:47:10.210277 7fd03aaf7700 0 lua.balancer MDS1: < auth.meta_load=228.72023977691 all.meta_load=186.5606496623 req_rate=28580.0 queue_len=0.0 cpu_load_avg=11.97 > load=186.5606496623 + 2016-08-21 06:47:10.210290 7fd03aaf7700 0 lua.balancer MDS2: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=1.0 queue_len=0.0 cpu_load_avg=11.97 > load=0.0 + 2016-08-21 06:47:10.210298 7fd03aaf7700 2 lua.balancer when: not migrating! my_load=415.79000078186 hisload=186.5606496623 + 2016-08-21 06:47:10.210311 7fd03aaf7700 2 mds.0.bal mantle decided that new targets={} + + +Implementation Details +---------------------- + +Most of the implementation is in MDBalancer. Metrics are passed to the balancer +policies via the Lua stack and a list of loads is returned back to MDBalancer. +It sits alongside the current balancer implementation and it's enabled with a +Ceph CLI command ("ceph mds set balancer mybalancer.lua"). If the Lua policy +fails (for whatever reason), we fall back to the original metadata load +balancer. The balancer is stored in the RADOS metadata pool and a string in the +MDSMap tells the MDSs which balancer to use. + +Exposing Metrics to Lua +~~~~~~~~~~~~~~~~~~~~~~ + +Metrics are exposed directly to the Lua code as global variables instead of +using a well-defined function signature. There is a global "mds" table, where +each index is an MDS number (e.g., 0) and each value is a dictionary of metrics +and values. The Lua code can grab metrics using something like this: + +:: + + mds[0]["queue_len"] + + +This is in contrast to cls-lua in the OSDs, which has well-defined arguments +(e.g., input/output bufferlists). Exposing the metrics directly makes it easier +to add new metrics without having to change the API on the Lua side; we want +the API to grow and shrink as we explore which metrics matter. The downside of +this approach is that the person programming Lua balancer policies has to look +at the Ceph source code to see which metrics are exposed. We figure that the +Mantle developer will be in touch with MDS internals anyways. + +The metrics exposed to the Lua policy are the same ones that are already stored +in mds_load_t: auth.meta_load(), all.meta_load(), req_rate, queue_length, +cpu_load_avg. + +Compile/Execute the Balancer +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Here we use `lua_pcall` instead of `lua_call` because we want to handle errors +in the MDBalancer. We do not want the error propagating up the call chain. The +cls_lua class wants to handle the error itself because it must fail gracefully. +For Mantle, we don't care if a Lua error crashes our balancer -- in that case, +we'll fall back to the original balancer. + +The performance improvement of using `lua_call` over `lua_pcall` would not be +leveraged here because the balancer is invoked every 10 seconds by default. + +Returning Policy Decision to C++ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +We force the Lua policy engine to return a table of values, corresponding to +the amount of load to send to each MDS. These loads are inserted directly into +the MDBalancer "my_targets" vector. We do not allow the MDS to return a table +of MDSs and metrics because we want the decision to be completely made on the +Lua side. + +Iterating through tables returned by Lua is done through the stack. In Lua +jargon: a dummy value is pushed onto the stack and the next iterator replaces +the top of the stack with a (k, v) pair. After reading each value, pop that +value but keep the key for the next call to `lua_next`. + +Debugging +~~~~~~~~ + +Logging in a Lua policy will appear in the MDS log. The syntax is the same as +the cls logging interface: + +:: + + BAL_LOG(0, "this is a log message") + + +It is implemented by passing a function that wraps the `dout` logging framework +(`dout_wrapper`) to Lua with the `lua_register()` primitive. The Lua code is +actually calling the `dout` function in C++. + +Testing +~~~~~~ + +Testing is done with the ceph-qa-suite (tasks.cephfs.test_mantle). We do not +test invalid balancer logging and loading the actual Lua VM. diff --git a/src/mds/MDBalancer.cc b/src/mds/MDBalancer.cc index 33bbbe91f940..df66226f79d6 100644 --- a/src/mds/MDBalancer.cc +++ b/src/mds/MDBalancer.cc @@ -37,6 +37,7 @@ using std::map; using std::vector; #include "common/config.h" +#include "common/errno.h" #define dout_subsys ceph_subsys_mds #undef DOUT_COND @@ -170,6 +171,28 @@ mds_load_t MDBalancer::get_load(utime_t now) return load; } +int MDBalancer::localize_balancer(string const balancer) +{ + int64_t pool_id = mds->mdsmap->get_metadata_pool(); + string fname = "/tmp/" + balancer; + + dout(15) << "looking for balancer=" << balancer << " in RADOS pool_id=" << pool_id << dendl; + object_t oid = object_t(balancer); + object_locator_t oloc(pool_id); + bufferlist data; + C_SaferCond waiter; + mds->objecter->read(oid, oloc, 0, 0, CEPH_NOSNAP, &data, 0, &waiter); + int r = waiter.wait(); + if (r == 0) { + dout(15) << "write data from RADOS into fname=" << fname << " data=" << data.c_str() << dendl; + data.write_file(fname.c_str()); + } else { + dout(0) << "tick could not find balancer " << balancer + << " in RADOS: " << cpp_strerror(r) << dendl; + } + return r; +} + void MDBalancer::send_heartbeat() { utime_t now = ceph_clock_now(g_ceph_context); @@ -606,8 +629,13 @@ void MDBalancer::prep_rebalance(int beat) int MDBalancer::mantle_prep_rebalance() { - /* hard-code lua balancer */ - string script = "BAL_LOG(0, \"I am mds \"..whoami)\n return {11, 12, 3}"; + /* pull metadata balancer from RADOS */ + string const balancer = mds->mdsmap->get_balancer(); + if (balancer == "" || localize_balancer(balancer)) + return -ENOENT; + ifstream f("/tmp/" + balancer); + string script((istreambuf_iterator(f)), + istreambuf_iterator()); /* prepare for balancing */ int cluster_size = mds->get_mds_map()->get_num_in_mds(); diff --git a/src/mds/MDSMap.cc b/src/mds/MDSMap.cc index eec9d4687629..a0d7336e2b69 100644 --- a/src/mds/MDSMap.cc +++ b/src/mds/MDSMap.cc @@ -487,6 +487,7 @@ void MDSMap::encode(bufferlist& bl, uint64_t features) const ::encode(session_autoclose, bl); ::encode(max_file_size, bl); ::encode(max_mds, bl); + ::encode(balancer, bl); __u32 n = mds_info.size(); ::encode(n, bl); for (map::const_iterator i = mds_info.begin(); @@ -515,6 +516,7 @@ void MDSMap::encode(bufferlist& bl, uint64_t features) const ::encode(session_autoclose, bl); ::encode(max_file_size, bl); ::encode(max_mds, bl); + ::encode(balancer, bl); __u32 n = mds_info.size(); ::encode(n, bl); for (map::const_iterator i = mds_info.begin(); @@ -542,7 +544,7 @@ void MDSMap::encode(bufferlist& bl, uint64_t features) const return; } - ENCODE_START(5, 4, bl); + ENCODE_START(6, 4, bl); ::encode(epoch, bl); ::encode(flags, bl); ::encode(last_failure, bl); @@ -551,6 +553,7 @@ void MDSMap::encode(bufferlist& bl, uint64_t features) const ::encode(session_autoclose, bl); ::encode(max_file_size, bl); ::encode(max_mds, bl); + ::encode(balancer, bl); ::encode(mds_info, bl, features); ::encode(data_pools, bl); ::encode(cas_pool, bl); @@ -583,7 +586,7 @@ void MDSMap::decode(bufferlist::iterator& p) std::map inc; // Legacy field, parse and drop cached_up_features = 0; - DECODE_START_LEGACY_COMPAT_LEN_16(5, 4, 4, p); + DECODE_START_LEGACY_COMPAT_LEN_16(6, 4, 4, p); ::decode(epoch, p); ::decode(flags, p); ::decode(last_failure, p); @@ -592,6 +595,7 @@ void MDSMap::decode(bufferlist::iterator& p) ::decode(session_autoclose, p); ::decode(max_file_size, p); ::decode(max_mds, p); + ::decode(balancer, p); ::decode(mds_info, p); if (struct_v < 3) { __u32 n; diff --git a/src/mds/MDSMap.h b/src/mds/MDSMap.h index a1aa73b0e1be..6cdb6b4b5200 100644 --- a/src/mds/MDSMap.h +++ b/src/mds/MDSMap.h @@ -193,6 +193,7 @@ protected: */ mds_rank_t max_mds; /* The maximum number of active MDSes. Also, the maximum rank. */ + string balancer; /* The name and version of the metadata load balancer. */ std::set in; // currently defined cluster @@ -289,6 +290,9 @@ public: mds_rank_t get_max_mds() const { return max_mds; } void set_max_mds(mds_rank_t m) { max_mds = m; } + std::string get_balancer() const { return balancer; } + void set_balancer(std::string val) { balancer.assign(val); } + mds_rank_t get_tableserver() const { return tableserver; } mds_rank_t get_root() const { return root; } diff --git a/src/mds/balancers/greedyspill.lua b/src/mds/balancers/greedyspill.lua new file mode 100644 index 000000000000..c3e38fa4a14d --- /dev/null +++ b/src/mds/balancers/greedyspill.lua @@ -0,0 +1,48 @@ +metrics = {"auth.meta_load", "all.meta_load", "req_rate", "queue_len", "cpu_load_avg"} + +-- Metric for balancing is the workload; also dumps metrics +function mds_load() + for i=0, #mds do + s = "MDS"..i..": < " + for j=1, #metrics do + s = s..metrics[j].."="..mds[i][metrics[j]].." " + end + mds[i]["load"] = mds[i]["all.meta_load"] + BAL_LOG(0, s.."> load="..mds[i]["load"]) + end +end + +-- Shed load when you have load and your neighbor doesn't +function when() + my_load = mds[whoami]["load"] + his_load = mds[whoami+1]["load"] + if my_load > 0.01 and his_load < 0.01 then + BAL_LOG(2, "when: migrating! my_load="..my_load.." hisload="..his_load) + return true + end + BAL_LOG(2, "when: not migrating! my_load="..my_load.." hisload="..his_load) + return false +end + +-- Shed half your load to your neighbor +-- neighbor=whoami+2 because Lua tables are indexed starting at 1 +function where() + targets = {} + for i=1, #mds+1 do + targets[i] = 0 + end + + targets[whoami+2] = mds[whoami]["load"]/2 + return targets +end + +mds_load() +if when() then + return where() +end + +targets = {} +for i=1, #mds+1 do + targets[i] = 0 +end +return targets diff --git a/src/mon/MDSMonitor.cc b/src/mon/MDSMonitor.cc index e5cd558b042f..122253ab278b 100644 --- a/src/mon/MDSMonitor.cc +++ b/src/mon/MDSMonitor.cc @@ -1911,6 +1911,15 @@ public: fs->mds_map.set_inline_data_enabled(false); }); } + } else if (var == "balancer") { + ss << "setting the metadata load balancer to " << val; + fsmap.modify_filesystem( + fs->fscid, + [val](std::shared_ptr fs) + { + fs->mds_map.set_balancer(val); + }); + return true; } else if (var == "max_file_size") { if (interr.length()) { ss << var << " requires an integer value"; diff --git a/src/mon/MonCommands.h b/src/mon/MonCommands.h index 1f9c91bf5ea7..6832e4643a33 100644 --- a/src/mon/MonCommands.h +++ b/src/mon/MonCommands.h @@ -330,7 +330,7 @@ COMMAND_WITH_FLAG("mds set_max_mds " \ "set max MDS index", "mds", "rw", "cli,rest", FLAG(DEPRECATED)) COMMAND_WITH_FLAG("mds set " \ "name=var,type=CephChoices,strings=max_mds|max_file_size" - "|allow_new_snaps|inline_data|allow_multimds|allow_dirfrags " \ + "|allow_new_snaps|inline_data|allow_multimds|allow_dirfrags|balancer " \ "name=val,type=CephString " \ "name=confirm,type=CephString,req=false", \ "set mds parameter to ", "mds", "rw", "cli,rest", FLAG(DEPRECATED))