From f1bc95dafe06cc982bdcbadaa46df304de3050fb Mon Sep 17 00:00:00 2001 From: Patrick Donnelly Date: Tue, 15 Oct 2024 17:12:28 -0400 Subject: [PATCH] doc/dev: add walkthrough for CephFS kernel development Specifically, an opinionated walkthrough of how to setup an environment for a built kernel, networking a VM to sepia, and mounting a remote Ceph cluster. Signed-off-by: Patrick Donnelly --- doc/dev/kclient.rst | 478 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 478 insertions(+) create mode 100644 doc/dev/kclient.rst diff --git a/doc/dev/kclient.rst b/doc/dev/kclient.rst new file mode 100644 index 0000000000000..fd4903ac1abd8 --- /dev/null +++ b/doc/dev/kclient.rst @@ -0,0 +1,478 @@ +Testing changes to the Linux Kernel CephFS driver +================================================= + +This walkthrough will explain one (opinionated) way to do testing of the Linux +kernel client against a development cluster. We will try to mimimize any +assumptions about pre-existing knowledge of how to do kernel builds or any +related best-practices. + +.. note:: There are many completely valid ways to do kernel development for + Ceph. This guide is a walkthrough of the author's own environment. + You may decide to do things very differently. + +Step One: build the kernel +========================== + +Clone the kernel: + +.. code-block:: bash + + git init linux && cd linux + git remote add torvalds git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git + git remote add ceph https://github.com/ceph/ceph-client.git + git fetch && git checkout torvalds/master + + +Configure the kernel: + +.. code-block:: bash + + make defconfig + +.. note:: You can alternatively use the `Ceph Kernel QA Config`_ for building the kernel. + +We now have a kernel config with reasonable defaults for the architecture you're +building on. The next thing to do is to enable configs which will build Ceph and/or +provide functionality we need to do testing. + +.. code-block:: bash + + cat > ~/.ceph.config < mount in less than 10 seconds). +* A fault in the kernel won't crash your machine. +* You have a suite of tools available for analysis on the running kernel. + +The main decision for you to make is what Linux distribution you want to use. +This document uses Arch Linux due to the author's familiarity. We also use LVM +to create a volume. You may use partitions or whatever mechanism you like to +create a block device. In general, this block device will be used repeatedly in +testing. You may want to use snapshots to avoid a VM somehow corrupting your +root disk and forcing you to start over. + + +.. code-block:: bash + + # create a volume + VOLUME_GROUP=foo + sudo lvcreate -L 256G "$VOLUME_GROUP" -n $(whoami)-vm-0 + DEV="/dev/${VOLUME_GROUP}/$(whoami)-vm-0" + sudo mkfs.xfs "$DEV" + sudo mount "$DEV" /mnt + sudo pacstrap /mnt base base-devel vim less jq + sudo arch-chroot /mnt + # # delete root's password for ease of login + # passwd -d root + # mkdir -p /root/.ssh && echo "$YOUR_SSH_KEY_PUBKEY" >> /root/.ssh/authorized_keys + # exit + sudo umount /mnt + +Once that's done, we should be able to run a VM: + + +.. code-block:: bash + + qemu-system-x86_64 -enable-kvm -kernel $(pwd)/arch/x86/boot/bzImage -drive file="$DEV",if=virtio,format=raw -append 'root=/dev/vda rw' + +You should see output like: + +:: + + VNC server running on ::1:5900 + +You could view that console using: + + +.. code-block:: bash + + vncviewer 127.0.0.1:5900 + +Congratulations, you have a VM running the kernel that you just built. + + +Step Three: Networking the VM +============================= + +This is the "hard part" and requires the most customization depending on what +you want to do. For this author, I currently have a development setup like: + + +:: + + sepian netns + ______________ + | | + | kernel VM | sepia-bounce VM vossi04.front.sepia.ceph.com + | ------- | | ------ ------- + | | | | | 192.168.20.1 | | | | + | | |--|--|- <- wireguard -> | | <-- sepia vpn -> | | + | |_____| | | 192.168.20.2 |____| |_____| + | br0 | + |______________| + + +The sepia-bounce VM is used as a bounce box to the sepia lab. It can proxy ssh +connections, route any sepia-bound traffic, or serve as a DNS proxy. The use of +a sepia-bounce VM is optional but can be useful, especially if you want to +create numerous kernel VMs for testing. + +I like to use the vossi04 `developer playground`_ to build Ceph and setup a +vstart cluster. It has sufficient resources to make building Ceph very fast +(~5 minutes cold build) and local disk resources to run a decent vstart +cluster. + +To avoid overcomplicating this document with the details of the sepia-bounce +VM, I will note the following main configurations used for the purpose of +testing the kernel: + +- setup a wireguard tunnel between the machine creating kernel VMs and the sepia-bounce VM +- use ``systemd-resolved`` as a DNS resolver and listen on 192.168.20.2 (instead of just localhost) +- connect to the sepia `VPN`_ and use `systemd resolved update script`_ to configure ``systemd-resolved`` to use the DNS servers acquired via DHCP from the sepia VPN +- configure ``firewalld`` to allow wireguard traffic and to masquerade and forward traffic to the sepia vpn + +The next task is to connect the kernel VM to the sepia-bounce VM. A network +namespace can be useful for this purpose to isolate traffic / routing rules for +the VMs. For me, I orchestrate this using a custom systemd one-shot unit that +looks like: + +:: + + # create the net namespace + ExecStart=/usr/bin/ip netns add sepian + # bring lo up + ExecStart=/usr/bin/ip netns exec sepian ip link set dev lo up + # setup wireguard to sepia-bounce + ExecStart=/usr/bin/ip link add wg-sepian type wireguard + ExecStart=/usr/bin/wg setconf wg-sepian /etc/wireguard/wg-sepian.conf + # move the wireguard interface to the sepian nents + ExecStart=/usr/bin/ip link set wg-sepian netns sepian + # configure the static ip and bring it up + ExecStart=/usr/bin/ip netns exec sepian ip addr add 192.168.20.1/24 dev wg-sepian + ExecStart=/usr/bin/ip netns exec sepian ip link set wg-sepian up + # logging info + ExecStart=/usr/bin/ip netns exec sepian ip addr + ExecStart=/usr/bin/ip netns exec sepian ip route + # make wireguard the default route + ExecStart=/usr/bin/ip netns exec sepian ip route add default via 192.168.20.2 dev wg-sepian + # more logging + ExecStart=/usr/bin/ip netns exec sepian ip route + # add a bridge interface for VMs + ExecStart=/usr/bin/ip netns exec sepian ip link add name br0 type bridge + # configure the addresses and bring it up + ExecStart=/usr/bin/ip netns exec sepian ip addr add 192.168.0.1/24 dev br0 + ExecStart=/usr/bin/ip netns exec sepian ip link set br0 up + # masquerade/forward traffic to sepia-bounce + ExecStart=/usr/bin/ip netns exec sepian iptables -t nat -A POSTROUTING -o wg-sepian -j MASQUERADE + + +When using the network namespace, we will use ``ip netns exec``. There is a +handy feature to automatically bind mount files into the ``/etc`` namespace for +commands run via that command: + +:: + + # cat /etc/netns/sepian/resolv.conf + nameserver 192.168.20.2 + +That file will configure the libc name resolution stack to route DNS requests +for applications to the ``systemd-resolved`` daemon running on sepia-bounce. +Consequently, any application running in that netns will be able to resolve +sepia hostnames: + +:: + + $ sudo ip netns exec sepian host vossi04.front.sepia.ceph.com + vossi04.front.sepia.ceph.com has address 172.21.10.4 + + +Okay, great. We have a network namespace that forwards traffic to the sepia +VPN. The next mental step is to connect virtual machines running a kernel to +the bridge we have configured. The straightforward way to do that is to create +a "tap" device which connects to the bridge: + +.. code-block:: bash + + sudo ip netns exec sepian qemu-system-x86_64 \ + -enable-kvm \ + -kernel $(pwd)/arch/x86/boot/bzImage \ + -drive file="$DEV",if=virtio,format=raw \ + -netdev tap,id=net0,ifname=tap0,script="$HOME/bin/qemu-br0",downscript=no \ + -device virtio-net-pci,netdev=net0 \ + -append 'root=/dev/vda rw' + +The new relevant bits here are (a) executing the VM in the netns we have +constructed; (b) a ``-netdev`` command to configure a tap device; (c) a +virtual network card for the VM. There is also a script ``$HOME/bin/qemu-br0`` +run by qemu to configure the tap device it creates for the VM: + +:: + + #!/bin/bash + tap=$1 + ip link set "$tap" master br0 + ip link set dev "$tap" up + +That simply plugs the new tap device into the bridge. + +This is all well and good but we are now missing one last crucial step. What is +the IP address of the VM? There are two options: + +1. configure a static IP but the VM's root device networking stack + configuration must be modified +2. use DHCP and configure the root device for VMs to always use dhcp to + configure their ethernet device addresses + +The second option is more complicated to setup, since you must run a DHCP +server now, but provides the greatest flexibility for adding more VMs as needed +when testing. + +The modified (or "hacked") standard dhcpd systemd service looks like: + +:: + + # cat sepian-dhcpd.service + [Unit] + Description=IPv4 DHCP server + After=network.target network-online.target sepian-netns.service + Wants=network-online.target + Requires=sepian-netns.service + + [Service] + ExecStartPre=/usr/bin/touch /tmp/dhcpd.leases + ExecStartPre=/usr/bin/cat /etc/netns/sepian/dhcpd.conf + ExecStart=/usr/bin/dhcpd -f -4 -q -cf /etc/netns/sepian/dhcpd.conf -lf /tmp/dhcpd.leases + NetworkNamespacePath=/var/run/netns/sepian + RuntimeDirectory=dhcpd4 + User=dhcp + AmbientCapabilities=CAP_NET_BIND_SERVICE CAP_NET_RAW + ProtectSystem=full + ProtectHome=on + KillSignal=SIGINT + # We pull in network-online.target for a configured network connection. + # However this is not guaranteed to be the network connection our + # networks are configured for. So try to restart on failure with a delay + # of two seconds. Rate limiting kicks in after 12 seconds. + RestartSec=2s + Restart=on-failure + StartLimitInterval=12s + + [Install] + WantedBy=multi-user.target + +Similarly, the referenced dhcpd.conf: + +:: + + # cat /etc/netns/sepian/dhcpd.conf + option domain-name-servers 192.168.20.2; + option subnet-mask 255.255.255.0; + option routers 192.168.0.1; + subnet 192.168.0.0 netmask 255.255.255.0 { + range 192.168.0.100 192.168.0.199; + } + +Importantly, this tells the VM to route traffic to 192.168.0.1 (the IP of the +bridge in the netns) and DNS can be provided by 192.168.20.2 (via +``systemd-resolved`` on the sepia-bounce VM). + +In the VM, the networking looks like: + +:: + + [root@archlinux ~]# ip link + 1: lo: mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 + link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 + 2: enp0s3: mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 + link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff + 3: sit0@NONE: mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000 + link/sit 0.0.0.0 brd 0.0.0.0 + [root@archlinux ~]# ip addr + 1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 + link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 + inet 127.0.0.1/8 scope host lo + valid_lft forever preferred_lft forever + inet6 ::1/128 scope host noprefixroute + valid_lft forever preferred_lft forever + 2: enp0s3: mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 + link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff + inet 192.168.0.100/24 metric 1024 brd 192.168.0.255 scope global dynamic enp0s3 + valid_lft 28435sec preferred_lft 28435sec + inet6 fe80::5054:ff:fe12:3456/64 scope link proto kernel_ll + valid_lft forever preferred_lft forever + 3: sit0@NONE: mtu 1480 qdisc noop state DOWN group default qlen 1000 + link/sit 0.0.0.0 brd 0.0.0.0 + [root@archlinux ~]# systemd-resolve --status + Global + Protocols: +LLMNR +mDNS -DNSOverTLS DNSSEC=no/unsupported + resolv.conf mode: stub + Fallback DNS Servers: 1.1.1.1#cloudflare-dns.com 9.9.9.9#dns.quad9.net 8.8.8.8#dns.google 2606:4700:4700::1111#cloudflare-dns.com 2620:fe::9#dns.quad9.net 2001:4860:4860::8888#dns.google + + Link 2 (enp0s3) + Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6 + Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported + Current DNS Server: 192.168.20.2 + DNS Servers: 192.168.20.2 + + Link 3 (sit0) + Current Scopes: none + Protocols: -DefaultRoute +LLMNR +mDNS -DNSOverTLS DNSSEC=no/unsupported + + +Finally, some other networking configurations to consider: + +* Run the VM on your machine with full access to the host networking stack. If you have the sepia vpn, this will probably work without too much configuration. +* Run the VM in a netns as above but also setup the sepia vpn in the same netns. This can help to avoid using a sepia-bounce VM. You'll still need to configure routing between the bridge and the sepia VPN. +* Run the VM in a netns as above but only use a local vstart cluster (possibly in another VM) in the same netns. + + +Step Four: mounting a CephFS file system in your VM +--------------------------------------------------- + +This guide uses a vstart cluster on a machine in the sepia lab. Because the mon +addresses will change with any new vstart cluster, it will invalidate any +static configuration we may setup for our VM mounting the CephFS via the kernel +driver. So, we should create a script to fetch the configuration for our +vstart cluster prior to mounting: + +.. code-block:: bash + + #!/bin/bash + # kmount.sh -- mount a vstart Ceph cluster on a remote machine + + # the cephx client credential, vstart creates "client.fs" by default + NAME=fs + # static fs name, vstart creates an "a" file system by default + FS=a + # where to mount on the VM + MOUNTPOINT=/mnt + # cephfs mount point (root by default) + CEPHFS_MOUNTPOINT=/ + + function run { + printf '%s\n' "$*" >&2 + "$@" + } + + function mssh { + run ssh vossi04.front.sepia.ceph.com "cd ceph/build && (source vstart_environment.sh; $1)" + } + + # create the minimum config (including mon addresses) and store it in the VM's ceph.conf. This is not used for mounting; we're storing it for potential use with `ceph` commands. + mssh "ceph config generate-minimal-conf" > /etc/ceph/ceph.conf + # get the vstart cluster's fsid + FSID=$(mssh "ceph fsid") + # get the auth key associated with client.fs + KEY=$(mssh "ceph auth get-key client.$NAME") + # dump the v2 mon addresses and format for the -o mon_addr mount option + MONS=$(mssh "ceph mon dump --format=json" | jq -r '.mons[] | .public_addrs.addrvec[] | select(.type == "v2") | .addr' | paste -s -d/) + + # turn on kernel debugging (and any other debugging you'd like) + echo "module ceph +p" | tee /sys/kernel/debug/dynamic_debug/control + # do the mount! we use the new device syntax for this mount + run mount -t ceph "${NAME}@${FSID}.${FS}=${CEPHFS_MOUNTPOINT}" -o "mon_addr=${MONS},ms_mode=crc,name=${NAME},secret=${KEY},norequire_active_mds,noshare" "$MOUNTPOINT" + +That would be run like: + +.. code-block:: bash + + $ sudo ip netns exec sepian ssh root@192.168.0.100 ./kmount.sh + ... + mount -t ceph fs@c9653bca-110b-4f70-9f84-5a195b205e9a.a=/ -o mon_addr=172.21.10.4:40762/172.21.10.4:40764/172.21.10.4:40766,ms_mode=crc,name=fs,secret=AQD0jgln43pBCxAA7cJlZ4Px7J0UmiK4A4j3rA==,norequire_active_mds,noshare /mnt + $ sudo ip netns exec sepian ssh root@192.168.0.100 df -h /mnt + Filesystem Size Used Avail Use% Mounted on + fs@c9653bca-110b-4f70-9f84-5a195b205e9a.a=/ 169G 0 169G 0% /mnt + + +If you run into difficulties, it may be: + +* The firewall on the node running the vstart cluster is blocking your connections. +* Some misconfiguration in your networking stack. +* An incorrect configuration for the mount. + + +Step Five: testing kernel changes in teuthology +----------------------------------------------- + +There 3 static branches in the `ceph kernel git repository`_ managed by the Ceph team: + +* `for-linus `_: A branch managed by the primary Ceph maintainer to share changes with Linus Torvalds (upstream). Do not push to this branch. +* `master `_: A staging ground for patches planned to be sent to Linus. Do not push to this branch. +* `testing `_ A staging ground for miscellaneous patches that need wider QA testing (via nightlies or regular Ceph QA testing). Push patches you believe to be nearly ready for upstream acceptance. + +You may also push a ``wip-$feature`` branch to the ``ceph-client.git`` +repository which will be built by Jenkins. Then view the results of the build +in `Shaman `_. + +Once a kernel branch is built, you can test it via the ``fs`` CephFS QA suite: + +.. code-block:: bash + + $ teuthology-suite ... --suite fs --kernel wip-$feature --filter k-testing + + +The ``k-testing`` filter is looking for the fragment which normally sets +``testing`` branch of the kernel for routine QA. That is, the ``fs`` suite +regularly runs tests against whatever is in the ``testing`` branch of the +kernel. We are overriding that choice of kernel branch via the ``--kernel +wip-$featuree`` switch. + +.. note:: Without filtering for ``k-testing``, the ``fs`` suite will also run jobs using ceph-fuse or stock kernel, libcephfs tests, and other tests that may not be of interest to you when evaluating changes to the kernel. + +The actual override is controlled using Lua merge scripts in the +``k-testing.yaml`` fragment. See that file for more details. + + +.. _VPN: https://wiki.sepia.ceph.com/doku.php?id=vpnaccess +.. _systemd resolved update script: systemd-resolved: https://wiki.archlinux.org/title/Systemd-resolved +.. _Ceph Kernel QA Config: https://github.com/ceph/ceph-build/tree/899d0848a0f487f7e4cee773556aaf9529b8db26/kernel/build +.. _developer playground: https://wiki.sepia.ceph.com/doku.php?id=devplayground#developer_playgrounds +.. _ceph kernel git repository: https://github.com/ceph/ceph-client -- 2.39.5