Add http2 / quic comparison

2024-03-07 15:22:06 +01:00
parent 7131071b91
commit 0c501a817a
4 changed files with 106 additions and 47 deletions
--- a/res/include/md/100_gigabit_ethernet.md
+++ b/res/include/md/100_gigabit_ethernet.md
@@ -3,7 +3,7 @@
 - [Introduction](#introduction)
 - [Sysctls](#sysctls)
  - [net.ipv4.tcp_congestion_control](#net-ipv4-tcp-congestion-control)
-  - [net.core.default_qdisc](#net-core-default-qdisc)
+  - [net.core.default_qdisc and txqueuelen](#net-core-default-qdisc-and-txqueuelen)
  - [net.ipv4.tcp_shrink_window](#net-ipv4-tcp-shrink-window)
  - [net.ipv4.tcp_{w,r}mem](#net-ipv4-tcp-w-r-mem)
  - [net.ipv4.tcp_mem](#net-ipv4-tcp-mem)
@@ -17,9 +17,10 @@
  - [SMT Control](#smt-control)
  - [IOMMU](#iommu)
 - [Reverse proxy](#reverse-proxy)
+- [HTTP/2 or QUIC?](#http-2-or-quic)
 - [Operating system](#operating-system)
 - [Kernel](#kernel)
- [That's all, folks!](#thats-all-folks)
+- [That's all, folks!](#that-s-all-folks)

 ## Introduction

@@ -38,12 +39,13 @@ paid good money for that 100 Gigabit connection and I'll be damned if I can't
 use all of it. I'm getting to the bottom of this no matter how long it takes...

 It took me well over a year to figure out all the details of high speed
-networking. My good friend Jeff (who has been hosting pixeldrain for nearly ten
-years now) was able to point me in the right direction by showing me some
-sysctls and ethtool commands which might affect performance. That was just the
-entrance of the rabbit hole though, and this one carried on deep. After about a
-year of trial and error pixeldrain can finally serve files at 100 Gigabit per
-second.
+networking. My good friend [Jeff
+Brandt](https://www.linkedin.com/in/jeff-brandt-51b2a65/) (who has been hosting
+pixeldrain for nearly ten years now) was able to point me in the right direction
+by explaining the basics and showing me some sysctls and ethtool commands which
+might affect performance. That was just the entrance of the rabbit hole though,
+and this one carried on deep. After about a year of trial and error pixeldrain
+can finally serve files at 100 Gigabit per second.

 Below is a summary of everything I discovered during my year of reading NIC
 manuals, digging through the kernel sources, running profilers, patching the
@@ -74,7 +76,7 @@ but it works for IPv6 as well.

 `net.ipv4.tcp_congestion_control=bbr`

-### net.core.default_qdisc
+### net.core.default_qdisc and txqueuelen

 The qdisc (queuing discipline) is another param which gets mentioned often. The
 qdisc orders packets which are queued so they can be sent in the most efficient
@@ -83,11 +85,18 @@ completely irrelevant, the network is rarely the bottleneck here.

 Google used to require `fq` with `bbr`, but that requirement has been dropped. I
 suggest you use something minimal and fast. How about `pfifo_fast`, it has fast
-in the name, must be good, right?
+in the name, must be good, right? This is actually already the default on Linux
+nowadays, so there's not really a need to change it.

 `net.core.default_qdisc=pfifo_fast`

-This option is only applied after a reboot.
+A queue must have a size though. Linux gives the network queues a size of 1000
+packets by default. As we'll learn later, a thousand packets is really not a lot
+when running at 100 Gbps. When the queue is full the kernel will actually drop
+packets, which is absolutely not what we want. So we increase the queue length
+to 10000 packets instead:
+
+`ip link set $INTERFACE txqueuelen 10000`

 ### net.ipv4.tcp_shrink_window

@@ -103,13 +112,17 @@ Cloudflare has an extensive writeup about the problem this sysctl solves here:
 it](https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/)

 This sysctl makes sure that TCP buffers are shrunk if they are larger than they
-need to be. Without this sysctl your buffers will grow forever! Before I
-discovered this patch my servers would regularly run out of memory during peak
-load, and these are servers with a TERABYTE OF RAM! After applying the patches
-(and compiling the kernel, because the patches were not merged yet back then)
-memory usage from TCP buffers was reduced by 90%. And performance has improved
-considerably. This patch is so crucial for performance that it boggles my mind
-that it's not enabled by default.
+need to be. Without this sysctl your buffers will just continue to grow until
+memory runs out! Before I discovered this patch my servers would regularly run
+out of memory during peak load, and these are servers with a **TeraByte of
+RAM**! After applying the patches (and compiling the kernel, because the patches
+were not merged yet back then) memory usage from TCP buffers was reduced by 80%
+on my systems. And performance has improved considerably. This patch is so
+crucial for performance that it boggles my mind that it's not enabled by
+default. It's even described in the [TCP
+spec](https://www.rfc-editor.org/rfc/rfc7323#section-2.4), it's standardized
+behaviour. If you're a kernel or systemd developer, please consider just turning
+this on by default instead of hiding it behind a toggle.

 `net.ipv4.tcp_shrink_window=1`

@@ -124,7 +137,7 @@ important for throughput.
 ### net.ipv4.tcp_{w,r}mem

 These variables dictate how much memory can be allocated for your send and
-receive buffers. The send and receive buffers are where TCP packets are stores
+receive buffers. The send and receive buffers are where TCP packets are stored
 which are not yet acknowledged by the peer. The required size of these buffers
 depends on your [Bandwidth-Delay Product
 (BDP)](https://en.wikipedia.org/wiki/Bandwidth-delay_product). This concept is
@@ -150,7 +163,9 @@ a 260ms round trip.

 The kernel also stores some other TCP-related stuff in that memory, and we also
 need to account for packet loss which causes packets to be stored for a longer
-time. For this reason pixeldrain servers use a maximum buffer size of 1 GiB.
+time. I also don't want the speed to be limited to 10 Gbps, we're running a 100
+GbE NIC after all. For this reason pixeldrain servers use a maximum buffer size
+of 1 GiB.

 ```
 net.ipv4.tcp_wmem='4096 65536 1073741824'
@@ -160,7 +175,9 @@ net.core.rmem_max=1073741824
 ```

 The three values in the wmem and rmem are the minimum buffer size, the default
-buffer size and the maximum buffer size.
+buffer size and the maximum buffer size. The pixeldrain server application uses
+64k reusable buffers (with [sync.Pool](https://pkg.go.dev/sync#Pool)) all over
+the codebase. For this reason we initialize the window size at 64k as well.

 ### net.ipv4.tcp_mem

@@ -170,8 +187,9 @@ kernel still limits the TCP buffers globally.

 This sysctl configures how much system memory can be used for TCP buffers. On
 boot these values are set based on available system memory, which is good. But
-by default it only uses like 5% of the memory, which is bad. We need to pump
-those numbers way up to get anywhere near the speed that we want.
+by default it only uses like 5% of the memory, which is not even close to
+enough. We need to pump those numbers way up to get anywhere near the speed that
+we want.

 tcp_mem is defined as three separate values. These values are in numbers of
 memory pages. A memory page is usually 4096B. Here is what these three values mean:
@@ -206,10 +224,10 @@ I set these values dynamically per host with Ansible:

 ## Network Interface Cards

-There are lots of NICs to choose from. From my testing there are a lot of bad
-apples in the bunch. The only NIC types I have had any luck with are ConnectX-5
-and ConnectX-6. Intel's E810 NICs are also not terrible, but Nvidia cards seem
-to fare better with high connection counts. I currently have two servers with
+There are lots of NICs to choose from. From my testing every NIC seems to behave
+differently. The only NIC types I have had any luck with are ConnectX-5 and
+ConnectX-6. Intel's E810 NICs are also not terrible, but Nvidia cards seem to
+fare much better with high connection counts. I currently have two servers with
 E810 cards and two servers with ConnectX-6 cards. The E810 cards are usually the
 first to crap out during a load peak. NICs are just fickle beasts overall. I
 don't know if my experiences are actually related to the quality of the cards,
@@ -236,6 +254,10 @@ Ethtool needs your network interface name for every operation. In this guide we
 will refer to your interface name as `$INTERFACE`. You can get your interface
 name from `ip a`.

+Ethtool options are not persistent through reboots. And there's no configuration
+file to put them in either. So you'll need to put them in a script which runs
+somewhere in the boot process somehow.
+
 ### Channels (ethtool -l)

 The channels param configures how many CPU cores will communicate with the NIC.
@@ -504,6 +526,35 @@ made](https://github.com/Fornaxian/zerodown). I use it for pixeldrain and it
 works like a charm. Your software updates are just one `SIGHUP` away from being
 deployed.

+## HTTP/2 or QUIC?
+
+HTTP/2 and QUIC (HTTP/3) are new revisions of the HyperText Transfer Protocol.
+HTTP/2 introduces multiplexing which significantly reduces handshake latency.
+HTTP/1.1 will open a separate TCP session for each file it needs to reqeust,
+HTTP/2 opens one connection instead and uses framing to send multiple requests
+at the same time instead, this allows the connection to ramp up to a higher
+speed and quicker. This goes hand in hand with the BBR congestion control
+algorithm which also significantly reduces connection ramp-up time. The result
+is 60% faster loading times for web pages on average.
+
+HTTP/2 is trivially enabled in the Go HTTP server. Simply add `NextProtos =
+[]string{"h2"}` to your `tls.Config` and it's good to go. An annoying
+implementation detail is that Go's HTTP/2 server throws completely different
+errors than HTTP/1.1, so you will have to redo all your error handling. To make
+matters worse, HTTP/2's errors are not exported by the `http` package, so you
+have to resort to string searching to catch these errors.. 😒.
+
+Then along comes HTTP/3, also known as QUIC. HTTP/3 throws everything we just
+did out of the window and uses UDP instead. It moves all the buffer management
+and congestion control to userspace. Sure, you get more control that way, but
+that's really only useful if you're Google. I tried the most popular HTTP/3
+server implementation for Go, and it struggled to even reach half of the
+throughput I got with HTTP/2. Sure, latency is lower, but that's not that useful
+to me when the most important part of my site stops functioning. Sure, TCP is
+not perfect, but it's better than having to do everything yourself.
+
+To summarize, if you only care about throughput: HTTP/2 👍 HTTP/3 👎 (for now)
+
 ## Operating system

 Choose something up-to-date, lightweight and minimalist. Pixeldrain used to run
@@ -548,6 +599,13 @@ promising a 40% TCP performance boost. Crazy!

 ## That's all, folks!

+**Behold.. One hundred gigabits per second!**
+
+![nload showing 85 Gbps](/res/img/100gbps.webp)
+
+Actually my nload seems to cap out at around 87 Gbps.. there's probably some
+overhead somewhere. It's close though.
+
 I hope this guide was useful to you. I wish I had something like this when I
 started out. I could have quite literally saved me months of time. Then again,
 chasing 100 Gigabit is one of the most educative challenges I have ever faced. I
@@ -561,7 +619,9 @@ all.
 Anyway, check out [Pixeldrain](/) if you like, it's the fastest way to transfer
 files across the web. And I'm working on a [cloud storage](/filesystem) offering
 as well. It has built in rclone and FTPS support. Pixeldrain also has a built in
-[speedtest](/speedtest) which you can use to see the fruits of my labour.
+[speedtest](/speedtest) which you can use to see the fruits of my labour. The
+source for this document is available in markdown format on [my
+GitHub](https://github.com/Fornaxian/pixeldrain_web/blob/master/res/include/md/100_gigabit_ethernet.md).

 Follow me on [Mastodon](https://mastodon.social/@fornax),
 [Twitter](https://twitter.com/Fornax96), join our
--- a/res/static/img/100gbps.webp
+++ b/res/static/img/100gbps.webp
--- a/res/static/style/layout.css
+++ b/res/static/style/layout.css
@@ -120,10 +120,21 @@ p>img {
 	max-width: 100%;
 }

+pre,
 code {
-	display: inline-block;
 	background: var(--background);
 	border-radius: 5px;
+	margin: 0;
+	padding: 0 0.2em;
+}
+
+pre {
+	overflow-x: auto;
+}
+
+pre>code {
+	background: none;
+	padding: 0;
 }

 /* Page layout elements */
@@ -425,12 +436,6 @@ tr>th {
 	padding: 0.2em 0.5em;
 }

-pre {
-	padding: 0.2em 0;
-	border-bottom: 1px var(--separator) solid;
-	overflow-x: auto;
-}
-
 /* API documentation markup */

 .api_doc_details {
--- a/svelte/src/speedtest/Speedtest.svelte
+++ b/svelte/src/speedtest/Speedtest.svelte
@@ -83,11 +83,11 @@ const measure_speed = (stop, test_duration) => {
 	// Updates per second
 	const ups = (1000/update_interval)

-	// This slice contains the speed measurements for four seconds of the test.
+	// This slice contains the speed measurements for three seconds of the test.
 	// This value is averaged and if the average is higher than the previously
 	// calculated average then it is saved. The resulting speed is the highest
-	// speed that was sustained for four seconds at any point in the test
-	const hist = new Uint32Array(ups*2)
+	// speed that was sustained for three seconds at any point in the test
+	const hist = new Uint32Array(ups*3)
 	let idx = 0

 	// This var measures for how many ticks the max speed has not changed. When
@@ -108,14 +108,8 @@ const measure_speed = (stop, test_duration) => {
 		idx++

 		// Calculate the average of all the speed measurements
-		const sum = hist.reduce((acc, val) => {
-			if (val !== 0) {
-				acc.sum += val
-				acc.count++
-			}
-			return acc
-		}, {sum: 0, count: 0})
-		const new_speed = (sum.sum/sum.count)*ups
+		const sum = hist.reduce((acc, val) => acc += val, 0)
+		const new_speed = (sum/hist.length)*ups
 		if (new_speed > speed) {
 			speed = new_speed
 			unchanged = 0