From 7131071b914a3c65485c90d6b6a83076f0807fab Mon Sep 17 00:00:00 2001
From: Wim Brand <wim@fornaxian.tech>
Date: Thu, 7 Mar 2024 01:01:52 +0100
Subject: [PATCH] Expand the fast ethernet guide

---
 res/include/md/100_gigabit_ethernet.md | 265 +++++++++++++++++--------
 1 file changed, 183 insertions(+), 82 deletions(-)

diff --git a/res/include/md/100_gigabit_ethernet.md b/res/include/md/100_gigabit_ethernet.md
index 68e3f05..de696fb 100644
--- a/res/include/md/100_gigabit_ethernet.md
+++ b/res/include/md/100_gigabit_ethernet.md
@@ -19,11 +19,13 @@
 - [Reverse proxy](#reverse-proxy)
 - [Operating system](#operating-system)
 - [Kernel](#kernel)
+- [That's all, folks!](#thats-all-folks)
 
 ## Introduction
 
-You might have just downloaded a 10 GB file in 20 seconds and wondered how that
-is even possible. Well, it took a lot of effort to get there.
+If you are one of the lucky few who has a fast enough connection, you might have
+just downloaded a 5 GB file in 10 seconds and wondered how that is even
+possible. Well, it took a lot of effort to get there.
 
 When I first ordered a 100 GbE server I expected things to just work. Imagine my
 surprise when the server crashed when serving at just 20 Gigabit.
@@ -31,13 +33,21 @@ surprise when the server crashed when serving at just 20 Gigabit.
 Then I expected my server host to be able to help with the performance problems.
 Spoiler alert: They could not help me.
 
-That's where my journey into the rabbit hole of network performance started. It
-took me just about a year to figure out all the details of high speed
-networking, and now pixeldrain can finally serve files at 100 Gigabit per
+That's where my journey into the rabbit hole of network performance started. I
+paid good money for that 100 Gigabit connection and I'll be damned if I can't
+use all of it. I'm getting to the bottom of this no matter how long it takes...
+
+It took me well over a year to figure out all the details of high speed
+networking. My good friend Jeff (who has been hosting pixeldrain for nearly ten
+years now) was able to point me in the right direction by showing me some
+sysctls and ethtool commands which might affect performance. That was just the
+entrance of the rabbit hole though, and this one carried on deep. After about a
+year of trial and error pixeldrain can finally serve files at 100 Gigabit per
 second.
 
 Below is a summary of everything I discovered during my year of reading NIC
-manuals, digging through the kernel sources and patching the kernel.
+manuals, digging through the kernel sources, running profilers, patching the
+kernel, learning about CPU topology and TCP inner workings.
 
 ## Sysctls
 
@@ -170,8 +180,9 @@ memory pages. A memory page is usually 4096B. Here is what these three values me
    limited.
  * `pressure`: When the TCP memory usage exceeds this threshold it will try to
    shrink some TCP buffers to free up memory. It will keep doing this until
-   memory usage drops below `low` again. You don't want to set `low` and
-   `pressure` too far apart.
+   memory usage drops below `low` again. Shrinking TCP buffers takes a lot of
+   CPU time, and during this time no data is sent to the client. You don't want
+   to set `low` and `pressure` too far apart.
  * `high`: The TCP system can't allocate more than this number of pages. If this
    limit is reached and a new TCP session is opened it will not be able to
    allocate any memory. Needless to say this is terrible for performance.
@@ -197,14 +208,23 @@ I set these values dynamically per host with Ansible:
 
 There are lots of NICs to choose from. From my testing there are a lot of bad
 apples in the bunch. The only NIC types I have had any luck with are ConnectX-5
-and ConnectX-6.
+and ConnectX-6. Intel's E810 NICs are also not terrible, but Nvidia cards seem
+to fare better with high connection counts. I currently have two servers with
+E810 cards and two servers with ConnectX-6 cards. The E810 cards are usually the
+first to crap out during a load peak. NICs are just fickle beasts overall. I
+don't know if my experiences are actually related to the quality of the cards,
+or just bad luck with faulty hardware.
 
 Often you see advice to install a proprietary driver for your NIC. Don't do
 that. In my experience that has only caused problems. Nvidia's NIC drivers are
 just as shitty as their video drivers. They will break kernel updates and
-generally make your life miserable.
+generally make your life miserable. The drivers in the Linux kernel are good and
+well maintained. You don't need to taint your kernel with some scary proprietary
+blob.
 
-Upgrading the firmware for your NIC can be a good idea.
+Upgrading the firmware for your NIC can be a good idea, if you can figure out
+how, that is. Nvidia's tools for upgrading firmware are a huge hassle to work
+with and the documentation is outdated and scarce.
 
 ## ethtool
 
@@ -227,7 +247,10 @@ it's not that big of a problem.
 
 If you are running on a multi-CPU platform you only want one CPU to communicate
 with the NIC. Distributing your channels over multiple CPUs will cause cache
-thrashing which absolutely tanks performance.
+thrashing which absolutely tanks performance. Many of pixeldrain's server are
+dual CPU, where one CPU runs the pixeldrain software and the other only
+communicates with the NIC. Buying a $10k CPU just to talk to a NIC is a bit
+wasteful, I recommend just using one CPU if you have the choice.
 
 Your NIC will usually configure the channels correctly on boot, so in most of
 the cases you don't need to change anything here. You can query the settings
@@ -236,46 +259,49 @@ $INTERFACE combined 63`.
 
 ### Ring buffers (ethtool -g)
 
-The ring buffers are queues where the NIC stores your IP packets before they are
-sent out to the network (tx) or sent to the CPU (rx). Increasing the ring buffer
-sizes can increase network latency a little bit because more packets are getting
-buffered before being sent out to the network. But again, at 100 GbE this
-happens so fast that the difference is in the order of microseconds, that makes
-absolutely no difference to us.
+The ring buffers are portions of RAM where the NIC stores your IP packets before
+they are sent out to the network (tx) or sent to the CPU (rx). Increasing the
+ring buffer sizes can increase network latency a little bit because more packets
+are getting buffered before being sent out to the network. But again, at 100 GbE
+this happens so fast that the difference is in the order of microseconds, that
+makes absolutely no difference to us. We just want to move as much data as
+possible in as little time as possible.
 
 If we can buffer more packets then it means we can transfer more data in bulk
 with every clock cycle. So we simply set this to the maximum. For Mellanox cards
 the maximum is usually `8192`, but this can vary. Check the maximum values for
 your card with `ethtool -g $INTERFACE`.
 
-`ethtool -G $INTERFACE rx 8192 tx 8192`
+Set the receive and send buffers to 8192 packets: `ethtool -G $INTERFACE rx 8192
+tx 8192`
 
 ### Interrupt Coalescing (ethtool -c)
 
 The NIC can't just write your packets to the CPU and expect it to do something
 with them. Your CPU needs to be made aware that there is new data to process.
-That happens with interrupts. Ethtool's interrupt coalescing values tell the NIC
-when and how to send interrupts to the CPU. This is a delicate balance. We don't
-want to interrupt the CPU too often, because then it won't be able to get any
-work done. That's like getting a new ping in team chat every half hour, how are
-you supposed to concentrate like that? But if we set the interrupt rate too
-slow, the NIC won't be able to send all packets in time.
+That happens with an interrupt. Ethtool's interrupt coalescing values tell the
+NIC when and how to send interrupts to the CPU. This is a delicate balance. We
+don't want to interrupt the CPU too often, because then it won't be able to get
+any work done. That's like getting a new ping in team chat every half hour, how
+are you supposed to concentrate like that? But if we set the interrupt rate too
+slow, the NIC won't be able to send all packets in time before the buffers fill
+up.
 
 The interrupt coalescing options vary a lot per NIC type.. These are the ones
 which are present on my ConnectX-6 Dx: `rx-usecs`, `rx-frames`, `tx-usecs`,
-`tx-frames`, `cqe-mode-rx`, `cqe-mode-tx`. I'll explain what these are one by
-one:
+`tx-frames`, `cqe-mode-rx`, `cqe-mode-tx`. I'll explain what these are:
 
  * `rx-usecs`, `tx-usecs`: These values dictate how often the NIC interrupts the
    CPU to receive packets `rx` or send packets `tx`. The value is in
    microseconds. The SI prefix for micro is µ, but for convenience they use the
    letter u here. A microsecond is one-millionth of a second.
  * `rx-frames`, `tx-frames`: Like the values above this defines how often the
-   CPU is interrupted, but instead of interrupting the CPU at fixed moments it
-   interrupts the CPU when a certain number of packets is in the buffer.
+   CPU is interrupted, but instead of interrupting the CPU at a fixed interval
+   it interrupts the CPU when a certain number of packets are in the buffer.
  * `cqe-mode-rx`, `cqe-mode-tx`: These options enable packet compression in the
-   PCI bus. This is handy if your PCI bus is overloaded. In most cases it's best
-   to leave this at the default value.
+   PCI bus. This is handy if your PCI bus is a bottleneck, like when your 100G
+   NIC is plugged into 4x PCI 4.0 lanes, which only has 7.88 GB/s bandwidth. In
+   most cases it's best to leave these at the default value.
  * `adaptive-rx`, `adaptive-tx`: These values tell the NIC to calculate its own
    interrupt timings. This disregards the values we configure ourselves. The
    timings calculated by the NIC often prefer low latency over throughput and
@@ -285,21 +311,23 @@ one:
 So what are good values for these? Well, we can do some math here. Our NIC can
 send 100 Gigabits per second. That's 12.5 GB. A network packet is usually 1500
 bytes. This means that we need to send 8333333 packets per second to reach full
-speed. Our ring buffer can hold 8192 packets, so if we divide that number again
-we learn that we need to send 1017 full ring buffers per second to reach full
+speed. Our ring buffer can hold 8192 packets, so if we divide by that number we
+learn that we need to send 1017 entire ring buffers per second to reach full
 speed.
 
 Waiting for the ring buffer to be completely full is probably not a good idea,
 since then we can't add more packets until the previous packets have been copied
 out. So we want to be able to empty the ring buffer twice. That leaves us with
-ring buffers per second. Now convert that buffers per second number to µs per
-buffer: `1000000 / 2034 = 492µs`, we land on a value of 492µs per interrupt.
+2034 ring buffers per second. Now convert that buffers per second number to µs
+per buffer: `1000000 / 2034 = 492µs`, we land on a value of 492µs per interrupt.
 This is our ceiling value. Higher than this and the buffers will overflow. But
 492µs is nearly half a millisecond, that's an eternity in CPU time. That's high
 enough that it might actually make a measurable difference in packet latency. So
-we opt for a safe value of 100µs instead. That still gives the CPU plenty of
-time to do other work in between interrupts, but is low enough to barely make a
-measurable difference in latency.
+we opt for a more sane value of 100µs instead. That still gives the CPU plenty
+of time to do other work in between interrupts. A 3 GHz CPU core will be able to
+perform about 30000 calculations inbetween each interrupt. At the same time it's
+low enough to barely make a measurable difference in latency, at most a tenth of
+a millisecond.
 
 As for the `{rx,tx}-frames` variables. We just spent all that time calculating
 the ideal interrupt interval, I don't really want the NIC to start interrupting
@@ -315,7 +343,7 @@ ethtool -C $INTERFACE adaptive-rx off adaptive-tx off \
 		rx-frames 8192 tx-frames 8192
 ```
 
-Tip: If you want to see how much time your CPU is spending in handling
+Tip: If you want to see how much time your CPU is spending on handling
 interrupts, go into `htop`, then to Setup (F2) and enable "Detailed CPU time"
 under Display options. The CPU gauge will now show time spent on handling
 interrupts in purple. Press F10 to save changes.
@@ -328,10 +356,13 @@ most important optimizations must be configured here.
 ### NUMA Nodes per socket
 
 Big CPUs with lots of cores often segment their memory into NUMA nodes. These
-smaller nodes can coordinate better with each other because they are close to
-each other. But one downside is that the segmentation causes performance
-problems with NIC queues. Because of this you always need to set `NUMA nodes per
-socket` to `NPS1`.
+smaller nodes get exclusive access to a certain portion of RAM and don't have to
+contend over memory access with the other NUMA nodes. This can improve your
+performance... if your software supports it well. But from my testing the setup
+of one NIC queue per core does not combine well with having multiple NUMA nodes.
+The fact that I use Go, which does not have a NUMA aware scheduler as far as I
+know, probably does not help either. For these reasons I prefer to set `NUMA
+nodes per socket` to `NPS1`.
 
 Some AMD BIOSes also have an option called `ACPI SRAT L3 Cache as NUMA Domain`.
 This will create NUMA nodes based on the L3 cache topology, *even if you
@@ -347,39 +378,56 @@ performance.
 
 Most apps have no way to effectively use hundreds of CPU threads. At some point
 adding more threads will only consume more memory and CPU cycles just because
-they kernel scheduler and memory controller has to manage all those threads. My
-rule of thumb: If you have more than 64 threads: `SMT OFF`
+they kernel scheduler, memory controller and your language runtime have to
+manage all those threads. This can cause huge amounts of overhead. My rule of
+thumb: If you have 64 or more cores: `SMT OFF`
 
 ### IOMMU
 
 The [Input-output memory management
 unit](https://en.wikipedia.org/wiki/Input%E2%80%93output_memory_management_unit)
 is a CPU component for virtualizing your memory access. This can be useful if
-you run a lot of VMs for example. You know what it's also good for? **COMPLETELY
-DESTROYING NIC PERFORMANCE**.
+you run a lot of VMs for example. You know what it's also good for? **Completely
+destroying NIC performance**.
 
-A high end NIC needs to shuffle a lot of data over the PCI bus. When the IOMMU
-is enabled that means that the data needs to be shuffled through the IOMMU first
-before it can go into memory. This adds some latency. When you are running a
+A high end NIC needs to shuffle a lot of data over the PCI bus. A 100 GbE NIC in
+full duplex can reach up to 25 GB/s! When the IOMMU is enabled it means that all
+the data that the NIC sends/receives needs to go through the IOMMU first before
+it can go into RAM. This adds a little bit of latency. When you are running a
 high end NIC in your PCI slot, then the added latency makes sure that your NIC
-will *never ever reach the advertised speed*. In some cases the overhead is so
-large that the NIC will effectively drop off the PCI bus, immediately crashing
-your system once it gets only slightly overloaded.
+will **never ever get anywhwere near the advertised speed**. In some cases the
+overhead is so large that the NIC will effectively drop off the PCI bus,
+immediately crashing your system once it gets only slightly overloaded. Yes,
+really, I have seen this happen.
 
 Seriously, if you have a high end NIC plugged into your PCI slot and you have
-the IOMMU enabled. **You might as well plug a fucking brick into your PCI
-slot**, because that's about how useful your expensive NIC will be.
+the IOMMU enabled. **You might as well plug a goddamn brick into your PCI
+slot**, because that's about as useful as your expensive NIC will be.
 
 It took me way too long to find this information. The difference between IOMMU
-off and on is night and day. I am actually **furious** that it took me this long
-to discover this. All the NIC tuning guides talk about is tweaking little
-ethtool params and shit like that, the IOMMU was completely omitted. I was
-getting so desperate with my terrible NIC performance that I just started
-flipping toggles in the BIOS to see if anything made a difference, that's how I
-discovered that the IOMMU was the source of **all my problems**.
+off and on is night and day. I am actually **furious** that it took me so long
+to discover this. I spent *weeks* pulling hair out of my head trying to figure
+out why my NIC was locking up whenever I tried to put any real load on it. All
+the NIC tuning guides I could find talk about tweaking little ethtool params,
+installing drivers, updating firmware and useles crap like that, the IOMMU was
+completely omitted in every one of them. I was getting so desperate with my
+terrible NIC performance that I just started flipping toggles in the BIOS to see
+if anything made a difference. If you have any idea how long it takes to reboot
+a high end server system you know how tedious this is. That's how I discovered
+that the IOMMU was the source of **all my problems**.
+
+Ugh, just thinking about all the time I wasted because because nobody told me to
+just turn the IOMMU off gets my blood boiling. That's why I am writing this
+guide, I want to spare you the suffering.
 
 So yea... `AMD CBS > NBIO Common Options > IOMMU > Disabled` ...AND STAY DOWN!
 
+I also just turn off anything related to virtualization nowadays. Having
+virtualization options enabled when you are not running VMs is a waste of
+resources. No worries, docker is not virtualization, it's just namespacing,
+nothing virtual about that. And if you are running VMs.. well, consider bare
+metal. It's really not that scary and there is lots of performance to be gained.
+
 You can verify that your IOMMU is disabled with this command `dmesg | grep
 iommu`. Your IOMMU is disabled if it prints something along the lines of:
 
@@ -391,30 +439,34 @@ iommu`. Your IOMMU is disabled if it prints something along the lines of:
 If you see more output than that, you need to drop into the BIOS and nuke that
 shit immediately.
 
+One little caveat is that Linux requires the IOMMU to support more than 255 CPU
+threads. So if you have 256 threads and the IOMMU is turned off one of your
+threads will be disabled. So once again I will repeat my rule of thumb with
+regards to multithreading: If you have 64 or more cores: `SMT OFF`
+
 ## Reverse proxy
 
 A lot of sites run behind a reverse proxy like nginx or Caddy. It seems to be an
 industry standard nowadays. People are surprised when they learn that pixeldrain
-does not use a web server like nginx or Caddy.
+does not use one of the standard web servers.
 
-Turns out that 100 Gigabit per second is a lot of data. It takes a considerable
-amount of CPU time to churn through that much data, so ideally you want to touch
-it as few times as you can. And when you are moving that much data the memcpy
-overhead really starts to show its true face. At this scale playing hot potato
-with your HTTP requests is a really bad idea.
+As it turns out, 100 Gigabit per second is a lot of data. It takes a
+considerable amount of CPU time to churn through that much data, so ideally you
+want to touch it as few times as you can. At this scale playing hot potato with
+your HTTP requests is a really bad idea.
 
 A big bottleneck with networking on Linux is copying data across the kernel
 boundary. The kernel always needs to copy your buffers because userspace is
-dirty, would not want to share memory with that. When you are running a reverse
-proxy every request is effectively crossing the kernel boundary *six times*.
-Let's assume we're running nginx here, the client sends a request to the server.
-The kernel copies the request body from kernel space to nginx's listener (from
-kernel space to userspace), nginx opens a request to your app and copies the
-body the to localhost TCP socket (back to kernel space). The kernel sends the
-body to your app's listener on localhost (now it's in userspace again). And then
-the response body follows the same path again. Request: NIC -> kernel ->
-userspace -> kernel -> userspace. Response: userspace -> kernel -> userspace ->
-kernel -> NIC. That's crazy inefficient.
+dirty, ew, would not want to share memory with that. When you are running a
+reverse proxy every HTTP request is effectively crossing the kernel boundary
+*six times*. Let's assume we're running nginx here, the client sends a request
+to the server. The kernel copies the request body from kernel space to nginx's
+listener (from kernel space to userspace), nginx opens a request to your app and
+copies the body the to localhost TCP socket (back to kernel space). The kernel
+sends the body to your app's listener on localhost (now it's in userspace
+again). And then the response body follows the same path again. Request: NIC ->
+kernel -> userspace -> kernel -> userspace. Response: userspace -> kernel ->
+userspace -> kernel -> NIC. That's crazy inefficient.
 
 That's why pixeldrain just uses Go's built in HTTP server. Go's HTTP server is
 very complete. Everything you need is there:
@@ -429,19 +481,47 @@ very complete. Everything you need is there:
 The only requirement is that your app is written in Go. Of course other
 languages also have libraries for this.
 
+Zero-downtime restarts are a bit tricky. Luckily the geniuses tinkering away at
+the Linux kernel every day made something neat for us. It's called
+`SO_REUSEPORT` (Wow! Catchy name!). By putting this socket option on your TCP
+listener you allow future instances of your server process to listen on the same
+port at the same time. By doing this your upgrades become really quite simple:
+
+1. Upload new server executable to the server.
+2. Start the new executable up.
+3. When everything is initialized it starts listening on the same port as the
+   previous process using `SO_REUSEPORT`.
+4. After the listener is installed we signal to the old server process (which is
+   still running at this point) that it can start shutting down. The listener is
+   closed and the active HTTP requests are gracefully completed.
+5. Once the old listener is closed all new requests will go to the new process
+   and the upgrade is complete.
+
+Now there may be one question on your mind: How do I signal to the previous
+process that the new process has finished initializing? I have just the thing
+for you. [This handy-dandy library that I
+made](https://github.com/Fornaxian/zerodown). I use it for pixeldrain and it
+works like a charm. Your software updates are just one `SIGHUP` away from being
+deployed.
+
 ## Operating system
 
-Choose something up-to-date, light and minimalist. Pixeldrain used to run on
-Ubuntu because I was familiar with it, but after a while Ubuntu server got more
+Choose something up-to-date, lightweight and minimalist. Pixeldrain used to run
+on Ubuntu because I was familiar with it, but over time Ubuntu server got more
 bloated and heavy. Unnecessary stuff was being added with each new release
 (looking at you snapd), and I just didn't want to deal with that. Eventually I
 switched to Debian.
 
-Debian is much better than Ubuntu. After booting it for the first time there
+Debian is so much better than Ubuntu. After booting it for the first time there
 will only be like 10 processes running on the system, just the essentials. It
 really is a clean sandbox waiting for you to build a castle in it. It might take
 some getting used to, but it will definitely pay off.
 
+Anyway, that's just my opinion. In reality you can pick any distro you like. It
+does not really matter that much. Just keep in mind that some distro's ship
+newer kernels than others, and that's really quite important as we will learn in
+the next paragraph.
+
 ## Kernel
 
 You need to run at least kernel 6.1, because of the `net.ipv4.tcp_shrink_window`
@@ -449,8 +529,8 @@ sysctl. But generally, **newer is better**. There are dozens of engineers from
 Google, Cloudflare and Meta tinkering away at the Linux network stack every day.
 It gets better with every release, really, the pace is staggering.
 
-But doesn't Debian ship really old kernel packages? (you might ask) Yes...
-kinda. By using [this guide](https://wiki.debian.org/HowToUpgradeKernel) you can
+But doesn't Debian ship quite old kernel packages? (you might ask) Yes... kinda.
+By using [this guide](https://wiki.debian.org/HowToUpgradeKernel) you can
 upgrade your kernel version to the `testing` or even the `experimental` branch
 while keeping the rest of the OS the same.
 
@@ -465,3 +545,24 @@ blog](https://www.phoronix.com/linux/Linux+Networking) for new kernel features.
 Pretty much every kernel version that comes out boasts about huge network
 performance wins. I'm personally waiting for Kernel 6.8 to come out. They are
 promising a 40% TCP performance boost. Crazy!
+
+## That's all, folks!
+
+I hope this guide was useful to you. I wish I had something like this when I
+started out. I could have quite literally saved me months of time. Then again,
+chasing 100 Gigabit is one of the most educative challenges I have ever faced. I
+have learned so much about Linux's structure, kernel performance profiling, CPU
+architecture, the PCIe bus and tons of other things that I would never have
+known if I did not go down this rabbit hole. And I have a feeling the journey is
+not over. I will always have this urge to get the absolute most out of my
+servers. I'm paying for the whole CPU and I'm going to use the whole CPU after
+all.
+
+Anyway, check out [Pixeldrain](/) if you like, it's the fastest way to transfer
+files across the web. And I'm working on a [cloud storage](/filesystem) offering
+as well. It has built in rclone and FTPS support. Pixeldrain also has a built in
+[speedtest](/speedtest) which you can use to see the fruits of my labour.
+
+Follow me on [Mastodon](https://mastodon.social/@fornax),
+[Twitter](https://twitter.com/Fornax96), join our
+[Discord](https://discord.gg/pixeldrain), et cetera et cetera
\ No newline at end of file