Add 100 gigabit ethernet guide

2024-03-06 18:35:40 +01:00
parent 7b3f7975ed
commit 8a089e0464
3 changed files with 470 additions and 1 deletions
--- a/res/include/md/100_gigabit_ethernet.md
+++ b/res/include/md/100_gigabit_ethernet.md
@@ -0,0 +1,467 @@
+# Fornax's Guide To Ridiculously Fast Ethernet
+
+- [Introduction](#introduction)
+- [Sysctls](#sysctls)
+  - [net.ipv4.tcp_congestion_control](#net-ipv4-tcp-congestion-control)
+  - [net.core.default_qdisc](#net-core-default-qdisc)
+  - [net.ipv4.tcp_shrink_window](#net-ipv4-tcp-shrink-window)
+  - [net.ipv4.tcp_{w,r}mem](#net-ipv4-tcp-w-r-mem)
+  - [net.ipv4.tcp_mem](#net-ipv4-tcp-mem)
+- [Network Interface Cards](#network-interface-cards)
+- [ethtool](#ethtool)
+  - [Channels (ethtool -l)](#channels-ethtool-l)
+  - [Ring buffers (ethtool -g)](#ring-buffers-ethtool-g)
+  - [Interrupt Coalescing (ethtool -c)](#interrupt-coalescing-ethtool-c)
+- [BIOS](#bios)
+  - [NUMA Nodes per socket](#numa-nodes-per-socket)
+  - [SMT Control](#smt-control)
+  - [IOMMU](#iommu)
+- [Reverse proxy](#reverse-proxy)
+- [Operating system](#operating-system)
+- [Kernel](#kernel)
+
+## Introduction
+
+You might have just downloaded a 10 GB file in 20 seconds and wondered how that
+is even possible. Well, it took a lot of effort to get there.
+
+When I first ordered a 100 GbE server I expected things to just work. Imagine my
+surprise when the server crashed when serving at just 20 Gigabit.
+
+Then I expected my server host to be able to help with the performance problems.
+Spoiler alert: They could not help me.
+
+That's where my journey into the rabbit hole of network performance started. It
+took me just about a year to figure out all the details of high speed
+networking, and now pixeldrain can finally serve files at 100 Gigabit per
+second.
+
+Below is a summary of everything I discovered during my year of reading NIC
+manuals, digging through the kernel sources and patching the kernel.
+
+## Sysctls
+
+When looking into network performance problems the `sysctl`s are usually the
+first thing you get pointed at. There is **a ton** of conflicting information
+online about which sysctls do what and what to set them to.
+
+Sysctls are not persistent through reboots, add these lines to
+`/etc/sysctl.conf` to apply them at startup.
+
+Through experimentation and kernel recompilation I finally settled on these
+values:
+
+### net.ipv4.tcp_congestion_control
+
+You might have heard of BBR. Google's new revolutionary congestion control
+algorithm. You might have heard conflicting information about how good it is. I
+have extensively tested all congestion controls in the kernel and I can say
+without a doubt that BBR is the best, by far! BBR is the only algo which does
+not absolutely tank your transfer rate when a packet is lost.
+
+TCP BBR was merged into the kernel at version 4.9. I know the sysctl says ipv4,
+but it works for IPv6 as well.
+
+`net.ipv4.tcp_congestion_control=bbr`
+
+### net.core.default_qdisc
+
+The qdisc (queuing discipline) is another param which gets mentioned often. The
+qdisc orders packets which are queued so they can be sent in the most efficient
+order possible. The thing is, when you're sending at 100 Gbps then queuing is
+completely irrelevant, the network is rarely the bottleneck here.
+
+Google used to require `fq` with `bbr`, but that requirement has been dropped. I
+suggest you use something minimal and fast. How about `pfifo_fast`, it has fast
+in the name, must be good, right?
+
+`net.core.default_qdisc=pfifo_fast`
+
+This option is only applied after a reboot.
+
+### net.ipv4.tcp_shrink_window
+
+This sysctl was developed by Cloudflare. The patch was merged into Linux 6.1. If
+you are on an older kernel version than 6.1 you will need to manually apply [the
+patches](https://github.com/cloudflare/linux/) and compile the kernel on your
+machine. Without this patch the kernel will waste so much time and memory on
+buffer management that by the time you reach 100 Gigabit the kernel won’t even
+have time to run your app anymore.
+
+Cloudflare has an extensive writeup about the problem this sysctl solves here:
+[Unbounded memory usage by TCP for receive buffers, and how we fixed
+it](https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/)
+
+This sysctl makes sure that TCP buffers are shrunk if they are larger than they
+need to be. Without this sysctl your buffers will grow forever! Before I
+discovered this patch my servers would regularly run out of memory during peak
+load, and these are servers with a TERABYTE OF RAM! After applying the patches
+(and compiling the kernel, because the patches were not merged yet back then)
+memory usage from TCP buffers was reduced by 90%. And performance has improved
+considerably. This patch is so crucial for performance that it boggles my mind
+that it's not enabled by default.
+
+`net.ipv4.tcp_shrink_window=1`
+
+Cloudflare has some other sysctls as well, but those focus more on latency than
+throughput. You can find them here: [Optimizing TCP for high WAN throughput
+while preserving low
+latency](https://blog.cloudflare.com/optimizing-tcp-for-high-throughput-and-low-latency/).
+The `net.ipv4.tcp_collapse_max_bytes` sysctl they write about here was never
+merged into the kernel. But while it does improve latency a bit, it's not that
+important for throughput.
+
+### net.ipv4.tcp_{w,r}mem
+
+These variables dictate how much memory can be allocated for your send and
+receive buffers. The send and receive buffers are where TCP packets are stores
+which are not yet acknowledged by the peer. The required size of these buffers
+depends on your [Bandwidth-Delay Product
+(BDP)](https://en.wikipedia.org/wiki/Bandwidth-delay_product). This concept is
+crucial to understand. If you set the TCP buffers too small it will literally
+put a speed limit on your connection.
+
+First let's go over how TCP sends data. TCP can retransmit packets if the client
+did not receive them. To do this TCP needs to keep all the data it sends to the
+client in memory until the client acknowledges (ACK) that it has been properly
+received. The acknowledgment takes one round trip to the client and back.
+
+Let's say you want to send a file from Amsterdam to Tokyo. The server sends the
+first packet, 130ms later the client on Tokyo receives the data packet. The
+client then sends ACK to tell the server that the packet was properly received,
+the ACK takes 130ms to arrive back in Amsterdam. Only now can the server remove
+the packet from memory. The whole exchange took 260ms.
+
+Now let's say we want to send files at 10 Gigabit. 10 Gigabit is 1250 MB. We
+multiply the number of bytes we want to send per second by the number of seconds
+it takes to get back the ACK. That's `1250 MB * 0.260 s = 325 MB`. Now we know
+that our buffer needs to be at least 325 MB to reach a speed of 10 Gigabit over
+a 260ms round trip.
+
+The kernel also stores some other TCP-related stuff in that memory, and we also
+need to account for packet loss which causes packets to be stored for a longer
+time. For this reason pixeldrain servers use a maximum buffer size of 1 GiB.
+
+```
+net.ipv4.tcp_wmem='4096 65536 1073741824'
+net.core.wmem_max=1073741824
+net.ipv4.tcp_rmem='4096 65536 1073741824'
+net.core.rmem_max=1073741824
+```
+
+The three values in the wmem and rmem are the minimum buffer size, the default
+buffer size and the maximum buffer size.
+
+### net.ipv4.tcp_mem
+
+We just configured the buffer sizes, what's this for then? Well... we can tune
+TCP buffers per connection all we want, but all that is for nothing if the
+kernel still limits the TCP buffers globally.
+
+This sysctl configures how much system memory can be used for TCP buffers. On
+boot these values are set based on available system memory, which is good. But
+by default it only uses like 5% of the memory, which is bad. We need to pump
+those numbers way up to get anywhere near the speed that we want.
+
+tcp_mem is defined as three separate values. These values are in numbers of
+memory pages. A memory page is usually 4096B. Here is what these three values mean:
+
+ * `low`: When TCP memory is below this threshold then TCP buffer sizes are not
+   limited.
+ * `pressure`: When the TCP memory usage exceeds this threshold it will try to
+   shrink some TCP buffers to free up memory. It will keep doing this until
+   memory usage drops below `low` again. You don't want to set `low` and
+   `pressure` too far apart.
+ * `high`: The TCP system can't allocate more than this number of pages. If this
+   limit is reached and a new TCP session is opened it will not be able to
+   allocate any memory. Needless to say this is terrible for performance.
+
+After a lot of experimentation with these values I have come to the conclusion
+that the best values for these parameters are 60% of RAM, 70% of RAM and 80% of
+RAM. This will use most of the RAM for TCP buffers if needed, but also leaves
+plenty for your applications.
+
+I set these values dynamically per host with Ansible:
+
+```yaml
+{{noescape `- name: configure tcp_mem
+  sysctl:
+    name: net.ipv4.tcp_mem
+    value: "{{ (mempages|int * 0.6)|int }} {{ (mempages|int * 0.7)|int }} {{ (mempages|int * 0.8)|int }}"
+    state: present
+  vars:
+    mempages: "{{ ansible_memtotal_mb * 256 }}" # There are 256 mempages in a MiB`}}
+```
+
+## Network Interface Cards
+
+There are lots of NICs to choose from. From my testing there are a lot of bad
+apples in the bunch. The only NIC types I have had any luck with are ConnectX-5
+and ConnectX-6.
+
+Often you see advice to install a proprietary driver for your NIC. Don't do
+that. In my experience that has only caused problems. Nvidia's NIC drivers are
+just as shitty as their video drivers. They will break kernel updates and
+generally make your life miserable.
+
+Upgrading the firmware for your NIC can be a good idea.
+
+## ethtool
+
+Ethtool is a program which you can use to configure your network card. There is
+lots of stuff to configure here, but there are only three settings which really
+matter.
+
+Ethtool needs your network interface name for every operation. In this guide we
+will refer to your interface name as `$INTERFACE`. You can get your interface
+name from `ip a`.
+
+### Channels (ethtool -l)
+
+The channels param configures how many CPU cores will communicate with the NIC.
+You generally want this number to be equal to the number of CPU cores you have,
+that way the load will be evenly spread across your CPU. If you have more CPU
+cores than your NIC supports you can try turning multithreading off in the BIOS.
+Or just accept that only a portion of your cores will communicate with the NIC,
+it's not that big of a problem.
+
+If you are running on a multi-CPU platform you only want one CPU to communicate
+with the NIC. Distributing your channels over multiple CPUs will cause cache
+thrashing which absolutely tanks performance.
+
+Your NIC will usually configure the channels correctly on boot, so in most of
+the cases you don't need to change anything here. You can query the settings
+with `ethtool -l $INTERFACE` and update the values like this: `ethtool -L
+$INTERFACE combined 63`.
+
+### Ring buffers (ethtool -g)
+
+The ring buffers are queues where the NIC stores your IP packets before they are
+sent out to the network (tx) or sent to the CPU (rx). Increasing the ring buffer
+sizes can increase network latency a little bit because more packets are getting
+buffered before being sent out to the network. But again, at 100 GbE this
+happens so fast that the difference is in the order of microseconds, that makes
+absolutely no difference to us.
+
+If we can buffer more packets then it means we can transfer more data in bulk
+with every clock cycle. So we simply set this to the maximum. For Mellanox cards
+the maximum is usually `8192`, but this can vary. Check the maximum values for
+your card with `ethtool -g $INTERFACE`.
+
+`ethtool -G $INTERFACE rx 8192 tx 8192`
+
+### Interrupt Coalescing (ethtool -c)
+
+The NIC can't just write your packets to the CPU and expect it to do something
+with them. Your CPU needs to be made aware that there is new data to process.
+That happens with interrupts. Ethtool's interrupt coalescing values tell the NIC
+when and how to send interrupts to the CPU. This is a delicate balance. We don't
+want to interrupt the CPU too often, because then it won't be able to get any
+work done. That's like getting a new ping in team chat every half hour, how are
+you supposed to concentrate like that? But if we set the interrupt rate too
+slow, the NIC won't be able to send all packets in time.
+
+The interrupt coalescing options vary a lot per NIC type.. These are the ones
+which are present on my ConnectX-6 Dx: `rx-usecs`, `rx-frames`, `tx-usecs`,
+`tx-frames`, `cqe-mode-rx`, `cqe-mode-tx`. I'll explain what these are one by
+one:
+
+ * `rx-usecs`, `tx-usecs`: These values dictate how often the NIC interrupts the
+   CPU to receive packets `rx` or send packets `tx`. The value is in
+   microseconds. The SI prefix for micro is µ, but for convenience they use the
+   letter u here. A microsecond is one-millionth of a second.
+ * `rx-frames`, `tx-frames`: Like the values above this defines how often the
+   CPU is interrupted, but instead of interrupting the CPU at fixed moments it
+   interrupts the CPU when a certain number of packets is in the buffer.
+ * `cqe-mode-rx`, `cqe-mode-tx`: These options enable packet compression in the
+   PCI bus. This is handy if your PCI bus is overloaded. In most cases it's best
+   to leave this at the default value.
+ * `adaptive-rx`, `adaptive-tx`: These values tell the NIC to calculate its own
+   interrupt timings. This disregards the values we configure ourselves. The
+   timings calculated by the NIC often prefer low latency over throughput and
+   can quickly overwhelm the CPU with interrupts. So for our purposes this needs
+   to be disabled.
+
+So what are good values for these? Well, we can do some math here. Our NIC can
+send 100 Gigabits per second. That's 12.5 GB. A network packet is usually 1500
+bytes. This means that we need to send 8333333 packets per second to reach full
+speed. Our ring buffer can hold 8192 packets, so if we divide that number again
+we learn that we need to send 1017 full ring buffers per second to reach full
+speed.
+
+Waiting for the ring buffer to be completely full is probably not a good idea,
+since then we can't add more packets until the previous packets have been copied
+out. So we want to be able to empty the ring buffer twice. That leaves us with
+ring buffers per second. Now convert that buffers per second number to µs per
+buffer: `1000000 / 2034 = 492µs`, we land on a value of 492µs per interrupt.
+This is our ceiling value. Higher than this and the buffers will overflow. But
+492µs is nearly half a millisecond, that's an eternity in CPU time. That's high
+enough that it might actually make a measurable difference in packet latency. So
+we opt for a safe value of 100µs instead. That still gives the CPU plenty of
+time to do other work in between interrupts, but is low enough to barely make a
+measurable difference in latency.
+
+As for the `{rx,tx}-frames` variables. We just spent all that time calculating
+the ideal interrupt interval, I don't really want the NIC to start interrupting
+my CPU when it's not absolutely necessary. So we use the maximum ring buffer
+value here: `8192`. Your NIC might not support such high coalescing values. You
+can also try setting this to `4096` or `2048` if you notice problems.
+
+That leaves us with this configuration:
+
+```
+ethtool -C $INTERFACE adaptive-rx off adaptive-tx off \
+		rx-usecs 100 tx-usecs 100 \
+		rx-frames 8192 tx-frames 8192
+```
+
+Tip: If you want to see how much time your CPU is spending in handling
+interrupts, go into `htop`, then to Setup (F2) and enable "Detailed CPU time"
+under Display options. The CPU gauge will now show time spent on handling
+interrupts in purple. Press F10 to save changes.
+
+## BIOS
+
+Not even the BIOS is safe from our optimization journey. If fact, some of the
+most important optimizations must be configured here.
+
+### NUMA Nodes per socket
+
+Big CPUs with lots of cores often segment their memory into NUMA nodes. These
+smaller nodes can coordinate better with each other because they are close to
+each other. But one downside is that the segmentation causes performance
+problems with NIC queues. Because of this you always need to set `NUMA nodes per
+socket` to `NPS1`.
+
+Some AMD BIOSes also have an option called `ACPI SRAT L3 Cache as NUMA Domain`.
+This will create NUMA nodes based on the L3 cache topology, *even if you
+explicitly disabled NUMA in the memory addressing settings*. To fix this set
+`ACPI SRAT L3 Cache as NUMA Domain` to `Disabled`.
+
+### SMT Control
+
+Multithreading (or Hyperthreading, on Intel) can be a performance booster, but
+it can also be a performance bottleneck. If you have a CPU with a lot of cores,
+like AMD's Epyc lineup, then disabling SMT can be a good way to improve per-core
+performance.
+
+Most apps have no way to effectively use hundreds of CPU threads. At some point
+adding more threads will only consume more memory and CPU cycles just because
+they kernel scheduler and memory controller has to manage all those threads. My
+rule of thumb: If you have more than 64 threads: `SMT OFF`
+
+### IOMMU
+
+The [Input-output memory management
+unit](https://en.wikipedia.org/wiki/Input%E2%80%93output_memory_management_unit)
+is a CPU component for virtualizing your memory access. This can be useful if
+you run a lot of VMs for example. You know what it's also good for? **COMPLETELY
+DESTROYING NIC PERFORMANCE**.
+
+A high end NIC needs to shuffle a lot of data over the PCI bus. When the IOMMU
+is enabled that means that the data needs to be shuffled through the IOMMU first
+before it can go into memory. This adds some latency. When you are running a
+high end NIC in your PCI slot, then the added latency makes sure that your NIC
+will *never ever reach the advertised speed*. In some cases the overhead is so
+large that the NIC will effectively drop off the PCI bus, immediately crashing
+your system once it gets only slightly overloaded.
+
+Seriously, if you have a high end NIC plugged into your PCI slot and you have
+the IOMMU enabled. **You might as well plug a fucking brick into your PCI
+slot**, because that's about how useful your expensive NIC will be.
+
+It took me way too long to find this information. The difference between IOMMU
+off and on is night and day. I am actually **furious** that it took me this long
+to discover this. All the NIC tuning guides talk about is tweaking little
+ethtool params and shit like that, the IOMMU was completely omitted. I was
+getting so desperate with my terrible NIC performance that I just started
+flipping toggles in the BIOS to see if anything made a difference, that's how I
+discovered that the IOMMU was the source of **all my problems**.
+
+So yea... `AMD CBS > NBIO Common Options > IOMMU > Disabled` ...AND STAY DOWN!
+
+You can verify that your IOMMU is disabled with this command `dmesg | grep
+iommu`. Your IOMMU is disabled if it prints something along the lines of:
+
+```
+[    1.302786] iommu: Default domain type: Translated
+[    1.302786] iommu: DMA domain TLB invalidation policy: lazy mode
+```
+
+If you see more output than that, you need to drop into the BIOS and nuke that
+shit immediately.
+
+## Reverse proxy
+
+A lot of sites run behind a reverse proxy like nginx or Caddy. It seems to be an
+industry standard nowadays. People are surprised when they learn that pixeldrain
+does not use a web server like nginx or Caddy.
+
+Turns out that 100 Gigabit per second is a lot of data. It takes a considerable
+amount of CPU time to churn through that much data, so ideally you want to touch
+it as few times as you can. And when you are moving that much data the memcpy
+overhead really starts to show its true face. At this scale playing hot potato
+with your HTTP requests is a really bad idea.
+
+A big bottleneck with networking on Linux is copying data across the kernel
+boundary. The kernel always needs to copy your buffers because userspace is
+dirty, would not want to share memory with that. When you are running a reverse
+proxy every request is effectively crossing the kernel boundary *six times*.
+Let's assume we're running nginx here, the client sends a request to the server.
+The kernel copies the request body from kernel space to nginx's listener (from
+kernel space to userspace), nginx opens a request to your app and copies the
+body the to localhost TCP socket (back to kernel space). The kernel sends the
+body to your app's listener on localhost (now it's in userspace again). And then
+the response body follows the same path again. Request: NIC -> kernel ->
+userspace -> kernel -> userspace. Response: userspace -> kernel -> userspace ->
+kernel -> NIC. That's crazy inefficient.
+
+That's why pixeldrain just uses Go's built in HTTP server. Go's HTTP server is
+very complete. Everything you need is there:
+
+ * [Routing](https://github.com/julienschmidt/httprouter)
+ * [TLS (for HTTPS)](https://pkg.go.dev/crypto/tls)
+ * HTTP/2
+ * Even a [reverse
+   proxy](https://pkg.go.dev/net/http/httputil#NewSingleHostReverseProxy) if
+   you're into that kinda stuff
+
+The only requirement is that your app is written in Go. Of course other
+languages also have libraries for this.
+
+## Operating system
+
+Choose something up-to-date, light and minimalist. Pixeldrain used to run on
+Ubuntu because I was familiar with it, but after a while Ubuntu server got more
+bloated and heavy. Unnecessary stuff was being added with each new release
+(looking at you snapd), and I just didn't want to deal with that. Eventually I
+switched to Debian.
+
+Debian is much better than Ubuntu. After booting it for the first time there
+will only be like 10 processes running on the system, just the essentials. It
+really is a clean sandbox waiting for you to build a castle in it. It might take
+some getting used to, but it will definitely pay off.
+
+## Kernel
+
+You need to run at least kernel 6.1, because of the `net.ipv4.tcp_shrink_window`
+sysctl. But generally, **newer is better**. There are dozens of engineers from
+Google, Cloudflare and Meta tinkering away at the Linux network stack every day.
+It gets better with every release, really, the pace is staggering.
+
+But doesn't Debian ship really old kernel packages? (you might ask) Yes...
+kinda. By using [this guide](https://wiki.debian.org/HowToUpgradeKernel) you can
+upgrade your kernel version to the `testing` or even the `experimental` branch
+while keeping the rest of the OS the same.
+
+On the [Debian package tracker](https://tracker.debian.org/pkg/linux) you can
+see which kernel version ships in which repository. This is useful for picking
+which repo you want to use for your kernel updates. Pixeldrain gets its kernel
+updates from the `testing` branch. These are kernels which have been declared
+stable by the kernel developers and are generally safe to use.
+
+Keep an eye on the [Phoronix Linux Networking
+blog](https://www.phoronix.com/linux/Linux+Networking) for new kernel features.
+Pretty much every kernel version that comes out boasts about huge network
+performance wins. I'm personally waiting for Kernel 6.8 to come out. They are
+promising a 40% TCP performance boost. Crazy!