11

Tyblog | 35 Million Hot Dogs: Benchmarking Caddy vs. Nginx

 2 years ago
source link: https://blog.tjll.net/reverse-proxy-hot-dog-eating-contest-caddy-vs-nginx/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

« 35 Million Hot Dogs: Benchmarking Caddy vs. Nginx

This blog post is about benchmarking Caddy against Nginx and their respective performance metrics as reverse proxies. Be forewarned: I was very thorough and there are tons of graphs and tables in here. I didn’t want to make any mistakes! Nobody is allowed to make mistakes on the Internet.

You can proceed to the results if you want to, but you’ll hurt my vain writer’s feelings if you skip past the methodology. I swear I won’t use the term “web-scale” once.

Also, I worked in a gimmicky title because I’m sending the X-Hotdogs: aplenty header to each proxy and I totaled 35,127,259 requests over the course of these tests. I pay myself in bad jokes when I’m not on the clock (my original post title was a pun about hot dog eating contests).

Finally, a shout-out to my clients who are facilitating my freelance work that ultimately makes things like this possible. If it weren’t for people paying me for flexible hours and projects, I’d probably be spending 9-5 rearranging kanban cards. Y’all are the real heroes (you know who you are).

Background

For some reason I’ve spent an inordinate amount of time in my career working with reverse proxies. Apache, Nginx, traefik, various kubernetes services, and lately, Caddy. I don’t think it’s just recency bias to say that Caddy has been my favorite: simple, modern, and quick, it does what a modern reverse proxy should do without much hassle.

However, some people (understandably!) hesitate when confronted with a choice between the current reigning champion, Nginx, and a relative newcomer like Caddy. Nginx has been around for a long time and is very good at what it does. Caddy is new to the scene and makes tradeoffs, for better or worse - the C vs. golang authorship is just one example that is bound to result in some sort of performance delta.

So, I’m here to do some “"”science””” on the performance differences between Nginx and Caddy. I’ve already confessed to you that I’m a Caddy fanboy, but I’ll commit to being as impartial as I can when it comes to crunching the numbers. I’m not being paid by either Nginx or Caddy for this! (Although they’re welcome to if they want – I’m a starving freelancer.)

On Benchmarking
Image via DALL·E

Image via DALL·E

Before I get into the meat: benchmarking is really, really hard. There are so many potential variables that come into play and we’re operating so high up on the computer science abstraction stack that it’s easy to make mistakes. I spent a lot of time thinking about how to do this in a fair and reasonable way.

At Elastic I worked with two engineers who may be some of the best performance engineers I’ve ever personally encountered, Daniel Mitterdorfer and Dimitrios Liappis. Their work on Rally really stuck with me, particularly in regard to how careful and methodical they were when performing performance benchmarks. I drew ideas from a few different places when figuring out how to run these tests, but much of the rigor here came from the real professionals like them who share how to do benchmarking well. This professional shout-out is just a nod of respect and not trying to elicit favors from anybody (although Daniel did bring me chocolate to a company all-hands one time – and it was Austrian chocolate, too).

Regarding versions, note that Nginx offers a commercial variant of their software - Nginx Plus. I didn’t explore this option too heavily because Caddy is completely open source and doesn’t have any commercial features to fall back on, so we’ll compare these proxies on equal OSS footing.

Finally, if people have concerns about the data or want to collaborate, please do so! I’ll include notes here about how to reproduce my data. The test setup and load testing harness are all very automated so duplicating my results should be very, very achievable if you’d like to try any of this yourself.

(One more note – I’m bound to mix terms here and probably use “latency”, “time to first byte”, and more incorrectly. I think I got pretty close, but bear with me, this is my first time cosplaying as someone who knows anything about benchmarking.)

Design

We’ll only be testing Caddy and Nginx today, but I’ll be cutting along many different axes to understand their respective performance profiles as accurately as possible.

The tests that I’ll perform will focus on three areas:

  1. Purely in-proxy “synthetic” responses. I’m calling them “synthetic” because the proxy forms a response code and body solely within its own configuration: while processing the response, it does not retrieve any assets from disk and does not talk to a backend listening service. This isolates all responsibility for responding to requests to the reverse proxy itself and no other dependent resources.
  2. Serving files. Reverse proxies serve up precompiled static assets all the time, so this is an important use case to measure. It’s less pure than the synthetic tests since we’re now involving disk I/O, but remember that the reverse proxy can lean on caching, too.
    • Note that I’ll be cutting across this axis in two ways: with small responses – think a minimal index.html – and a large response body. In my tests I used jquery-3.6.1.js which clocks in at 109K and my index file is 4.5K according to du.
  3. Reverse proxying. This is a reverse proxy’s bread and butter (hopefully that’s obvious): chatting with a service listening on the backend side of the service. Although this is super important to measure, it’s tricky to do because now you’re introducing an entirely separate process that may impact response times. In practice, your backend will almost always be your bottleneck because the reverse proxy’s job is just to hand traffic along. In that regard I’m not a huge fan of this test because the performance of the backend target ends up bleeding into the numbers, but we need to do it anyway (because the “open a new socket” testing path needs to happen). Hopefully, all else equal, the deltas between different test runs will still be significant even if the absolute values aren’t strictly representative.

    I’ll be setting up a bare-bones lighttpd listener that itself responds with a synthetic response for quick turnaround. A slower backend service may be worth testing in the future so that I could measure what lots of open and waiting backend connections behaves like, but we’re already working with a lot of variables.

Those variables that I’ll change will be:

  • Concurrent requests. In practice, many operators/sites will very seldom break massive concurrent load, so I’m keen on hitting a few watermarks:
    • Low, casual use. What is performance like when the service isn’t under duress?
    • High - but not error-inducing - levels of load. What kind of sustained performance can we achieve without sacrificing reliability?
    • Service-breaking load (stress testing). What kind of service degradation do we observe when traffic levels exceed the capabilities of a listening proxy? Does the service queue up responses and incur longer response times? Or does it drop connections and cause client connection errors?
  • Default versus optimized settings. As we all know, defaults matter, so I would expect each service to behave pretty differently when profiled “out of the box” versus highly optimized with more finely-tuned configuration files.

What I want to measure:

  • System resource utilization. A reverse proxy that underperforms another but with a fraction of the CPU and memory used can just be scaled horizontally, potentially with much greater overall capacity when you eventually reach equivalent resource use. Therefore, it’s critical to ensure that we measure process resource consumption during the test and compare them.
  • HTTP response latency. I did some preliminary tests and found that the delta between “time to first byte” and “overall response time” ends up being pretty consistent, so I’m lumping both of those metrics into “how quickly the server responds”.
    • I ended up collecting - over the duration of each test - the minimum, median, average, 90th percentile, 95th percentile, and maximum response times. However, I’m excluding minimum and maximum latency from the visualizations because they tend to blow out the axes - minimum tends to represent “something gave out somewhere” and maximum isn’t super useful in a client scenario since, for example, serving p95 requests with great effectiveness but goofing on one hung request can skew everything similarly up. p95 will serve as “what’s the exceptionally bad case” number, then.
  • Total number of requests processed. I’m running with a 30-second window in each test, which provides a constant duration for each server to serve as many requests as possible.
  • Error rate. I want the proxies to operate correctly! There’s not much use with blazing-fast responses if they don’t do it right. This includes hung connections, refused connections, that sort of thing. Embarrassingly high error rates will be punished with snark against their social media teams on Twitter.

Finally, how I’ll measure it:

  • I chose k6 as my HTTP test driver. I’m aware of lots of different testing tools out there, the most venerable being projects like ApacheBench (ab) and wrk, but ab doesn’t do heavy concurrency and wrk is… fairly old and has some critiques. k6 is modern and fits my needs pretty well.
  • I’ll use psrecord to measure each reverse proxy’s resource use. You could measure this a million different ways, from prometheus to something super heavy-handed, but psrecord is more than sufficient. I just want to peek at one process’s (and it’s children’s) resource usage.
  • I’ll build hosts with Terraform (because That’s What We Use These Days) in EC2 and iterate with a simple bash script. I’ll visualize stuff with gnuplot because they tweeted at me (that’s a joke, I just like gnuplot)

Hypothesis

Image via DALL·E

Image via DALL·E

This is my blog post, so I get to pretend to be a real scientist here. Here’s my hypotheses for these tests:

  • Nginx will, overall, perform better than Caddy. I love Caddy! Don’t get me wrong. But Nginx is specifically designed to be a fast reverse proxy and it has had years of lead time to perfect itself. How much faster? I don’t know. That’s what we’re here to find out!
  • Caddy may perform better with out-of-the-box defaults. Without thinking too hard about it, I’d bet that Nginx may not parallelize aggressively, while Caddy probably benefits from lots of goroutines spinning off everywhere to handle requests. We may even see better performance than Nginx in the non-optimized case, but I’ll predict against that.
  • I would predict that the reverse proxies will perform best responding with synthetic responses, then static HTML file content, then reverse proxied requests. I may tweak tests so see whether you can make static asset responses faster via caching.

Procedures

Lucky for you, I’ve open sourced everything that I ran in these tests so that you can see them yourself. At the conclusion of writing all of my automation, each test looked like this (once configuring my AWS credentials and entering my nix shell):

VUS=300 ./bench.sh

“VUs” are k6’s term for “virtual users”, or the number of concurrent agents simulating traffic against the system under test. This script will:

  • Create two EC2 instances - their default size is c5.xlarge. I opted for the c5 series because I doubt I’ll be memory-bound, and I also want to give the proxies a chance to flex their concurrency chops (these instances have 4 cores). The instances are in a shared security group so that they can chatter easily.
  • Configure the system under test (SUT) as well as the test driver with NixOS. This sets up one with the proxies ready to test (though not started) and the other with k6 installed.
    • “Why NixOS?” could be a long answer, but here’s the short one: it lets me pin everything to a specific revision so ensure my tests are 100% reproducible, from the kernel/glibc up to the nginx and caddy builds, also I am a configuration management snob and Ansible gives me hives.
  • Then for nginx and caddy – and for each test type (we’re going O(n^2) on tests, baby):
    • ssh into the SUT
    • Start the service
    • Attach psrecord to the daemon
    • ssh into the test driver
    • Execute my k6 test script
    • Copy over the test results from the test driver
    • Copy over the resource measurements from the SUT
  • Finally, compile the data. I’m collecting a table of metrics from the combined psrecord table as well as some choice data from the k6 JSON report. I’ll feed what I can to gnuplot for pretty pictures.

I’d also like to call out that I did some light validation that my tests were consistent and repeatable by re-running tests in a few scenarios and I found that my results were very consistent. From what I could tell, my automation is stable and reliable.

Configuration

How I’m configuring these daemons is pretty important, too.

In my preliminary tests I measured some differences when observing how locations (in nginx.conf) and matchers (in my Caddyfile) behave. This is expected, as encountering a conditional or predicate of some type for every request incurs overhead, so I’m working with single-purpose configuration files.

(To be clear, I’m talking about stanzas like location / { ... } in an nginx.conf file. Does each reverse proxy have different performance profiles for how they match incoming requests? Almost certainly. And I should probably test that! But I’m already in over my head, and this exercise is purely about “how fast can they possibly go without breaking”, so adding request matching overhead as a variable is one variable too many at the moment).

Defaults

Anyway, here’s how I configured each service with out-of-the box defaults. The synthetic response configuration files are short – first, the Caddyfile:

:8080 {
  respond "Hello, world!"
}

…then the nginx.conf:

events {
    use epoll;
}

http {
    access_log off;

    server {
        listen 0.0.0.0:8080;

        return 200 "Hello, world!";
    }
}

Similarly, the settings for static file serving with minimal defaults are straightforward:

:8080 {
  root * /srv/static
  file_server
}
events {
    use epoll;
}

http {
    access_log off;

    server {
        listen 0.0.0.0:8080;

        location / {
            alias /srv/static/;
        }
    }
}

And finally, a generic reverse proxy stanza:

:8080 {
  reverse_proxy localhost:8081
}
events {
    use epoll;
}

http {
    access_log off;

    server {
        listen 0.0.0.0:8080;

        location / {
            proxy_pass http://127.0.0.1:8081;
        }
    }
}

Take note that, for both Caddy and Nginx, I am not enabling access logs. That’s a potential bottleneck and we aren’t trying to measure performance+logging, just trying to zero in on how fast each service serves requests.

Optimized

Compared to their defaults, what does an optimized configuration for both Caddy and Nginx look like?

For Caddy, I asked the experts! Both Francis and Matt (Matt is the original author and BDFL of Caddy) offered some advice, although most of it culminated with “there’s probably not a lot to tweak”. That’s fair enough! Many of the settings they offered as knobs and dials play more into long-running or persistent connections which I hadn’t set any tests for. (Matt also suggested large response body tests, which I didn’t originally include, but was a good idea, so I had to re-run all my tests over. Alas!)

Additionally, Matt had two requests of me:

  1. To test this recently released beta of Caddy that incorporates some changes that may impact performance one way or the other.
  2. To test this pull request that implements sendfile, which should be a definite improvement (I set sendfile on; in the optimized Nginx configuration, for example).

Sure! To be fair between Nginx and Caddy, I’ll include the tests with these added improvements as an appendix, because my goal with my standard barrage of tests is to compare vanilla, generally-available releases. Once these Caddy patches are out, their benefits (if the results identify them as such) can be directly applicable.

By contrast, here is a delta for my tuned Nginx configuration. Nginx is hungry – hungry for hot dogs (and more processor cores):

     events {
         use epoll;
+        worker_connections 1024;
     }
+
+    worker_processes auto;

     http {
         access_log off;
+        sendfile on;
+        proxy_cache_path /tmp/nginx-cache keys_zone=cache:10m;
+

         server {
             listen 0.0.0.0:8080;
+            proxy_cache cache;

worker_processes are going to be the magic words for Nginx here to get at those cores. auto is a convenient choice for us, so we’ll use that. In this Nginx blog post there are more tips for us, including advice to bump up worker_connections and to use sendfile() in our http/server/location context. I’ll reference the advice linked to in that blog post to get caching functioning as effectively as possible, too, using those proxy_* settings.

I didn’t change a ton here but it’s the set of changes that appear to safely make the biggest improvements. Many of the other suggestions in that aforementioned Nginx blog post are regarding operating system-level tweaks, which I’ve applied from the beginning (you can investigate the source code for my benchmarks for the specific ulimit and sysctl changes in the base.nix file).

Results

Are you ready for numbers?????? Did you skip here from the intro? Have you no shame?

Here’s how to read the following charts and graphs:

  • The big one in the middle is total HTTP response time. Lower bars are better because it means the server was responding faster.
  • The upper right chart is errors. If they start to go up, the reverse proxy is starting to fail - refused connections, that kind of thing. Any bars are bad (it means errors happened).
  • The lower right chart is for overall number of requests. This is a function of a few mechanisms which includes how quickly the test driver can fire off HTTP requests. Higher bars are better.
  • The second set of graphs measure CPU and memory use with CPU utilization on the left Y axis and memory on the right Y axis. Lower numbers are better, generally.

My graphs may look a little cramped, but I erred on the side of “ease of comparison” at the cost of density - otherwise it can be challenging to try and compare across different variables. I’ve set the enhanced flag on the gnuplot files, so you can click on the name of a plotted series in order to toggle it on and off. Clicking anywhere on the plot SVG will attach coordinates to your mouse but I have no idea what in the hell the numbers are (they don’t look like they’re tied to any axis).

10 Clients

Let’s start at the beginning: 10 concurrent clients. Remember, results are in milliseconds (ms). Although 10 concurrent clients is small, I’m also not throttling clients, and once a request completes, it comes right back, so there’s no time for the proxies to rest between requests. This means that although the volume is low, the rate of activity is still pretty busy.

test min median average p90 p95 max requests errors
nginx-default-synthetic 0.10 0.18 0.22 0.31 0.40 19.15 1062277 0
caddy-default-synthetic 0.12 0.22 0.26 0.35 0.48 24.90 937547 0
nginx-optimized-synthetic 0.10 0.18 0.22 0.31 0.39 4474.83 1051255 0
nginx-default-html-large 0.41 2.18 2.27 3.88 4.20 14.76 126996 0
caddy-default-html-large 0.46 1.93 2.27 3.71 3.96 11.98 126978 0
nginx-optimized-html-large 0.39 2.26 2.27 3.67 3.96 22.63 127025 0
nginx-default-html-small 0.12 0.22 0.25 0.34 0.43 20.13 949551 0
caddy-default-html-small 0.15 0.27 0.32 0.43 0.57 18.88 795378 0
nginx-optimized-html-small 0.13 0.25 0.28 0.39 0.47 20.60 850631 0
nginx-default-proxy 0.20 0.69 0.70 0.86 0.92 14.42 395210 0
caddy-default-proxy 0.22 0.45 0.53 0.78 1.04 14.50 506388 0
nginx-optimized-proxy 0.21 0.45 0.51 0.67 0.88 49.34 528288 0

Alright, we’re dealing with a relatively small number of concurrent users, and the results aren’t terribly surprising. That said, we’re gathering useful baselines to know which requests are more taxing than others.

  • The median and mode are so close for many of these tests that the little deltas almost aren’t noticeable. We hover around ~0.2ms for most requests, although the very large HTML responses take longer, which we’d expect. They’re pretty consistent at about ~2ms as well.
  • We have one case of max being an outlier, which reinforces the fact that we should primarily be looking at p95 rather than max for worse-case scenarios. max is still useful to know, but not strictly representative. I’ll still include it in the tables, but I’m not intensely interested in it.
  • No errors (I should hope so) and the differences between synthetic, file-based, and reverse proxy overall request count are about what I would expect. Synthetic should be easy, replying with HTML files is a little more, reverse proxying takes a few more bells and whistles, and large HTML responses are the most burdensome.

What else? The synthetic responses - which have the least complexity - push the highest request count, as you might expect. Nginx beats out Caddy for this test whether optimized or not in terms of overall requests. This one is interesting: look at the minimum and median response times for synthetic responses for either Caddy or Nginx (default or optimized, it doesn’t matter). There’s a very consistent 0.02ms penalty on Caddy’s response time. I’m not sure what it means! But it’s an interesting finding that I would wager is a golang runtime tax.

The charts tell a pretty clear story: large HTML responses take a lot of time. Despite those bars skewing the Y-axis, the values are still really close (within ~2ms of each other for most stats). All other test types (synthetic, small HTML, reverse proxied) are super, super close.

Now let’s look at resource consumption:

First and foremost, remember that c5.xlarge machines have 4 cores per host. That’s important context: the first thing that pops out to me is that Caddy seems to use whatever it can (Caddy’s highest line hovers around 300%, or 3 cores), but it’s apparent that – by default – nginx is bound to one core (it caps out at 100% CPU, or using one entire core), which is consistent with the nginx documentation that states worker_processes defaults to 1. The optimized tests have much more breathing room and use whatever they can. Nginx uses almost no memory (relatively speaking) whereas Caddy, presumably at startup, grabs a small handful (maybe 40MB) and doesn’t go far beyond that. Presumably that’s moreso the golang runtime more than anything else.

I also found that the “difficulty ranking” for each test sorted by CPU utilization was interesting because it’s re-ordered based on whether we look at the Nginx or Caddy graphs. Nginx, in “most busy” order, goes proxy → HTML small → HTML large → synthetic. Caddy is ordered HTML small → proxy → synthetic → HTML large. My hunch is that Caddy may be really grinding on small HTML tests because it’s copying a lot of buffers around.

Remember, this is 10 concurrent connections, so there’s nothing pressing hard on our reverse proxies. YET. Let’s bump it up to 200 concurrent clients.

200 Clients
test min median average p90 p95 max requests errors
nginx-default-synthetic 0.11 4.15 5.67 12.08 14.95 310.66 1007186 0
caddy-default-synthetic 0.13 4.14 5.64 11.77 14.76 67.48 1025549 0
nginx-optimized-synthetic 0.11 4.14 5.66 12.05 14.95 69.04 1016499 0
nginx-default-html-large 0.67 33.62 47.03 90.63 213.61 3210.70 127116 0
caddy-default-html-large 0.83 34.48 47.12 90.23 214.51 1673.57 127094 0
nginx-optimized-html-large 0.66 32.02 47.00 92.07 212.89 3207.05 127212 0
nginx-default-html-small 0.14 4.66 5.93 11.58 15.23 67.54 981828 0
caddy-default-html-small 0.17 4.68 6.40 13.26 17.08 210.98 920272 0
nginx-optimized-html-small 0.15 4.98 6.72 14.44 17.64 74.88 858443 0
nginx-default-proxy 0.24 14.69 14.68 17.65 18.15 186.88 406757 0
caddy-default-proxy 0.27 10.97 13.28 25.70 32.29 103.55 449154 0
nginx-optimized-proxy 0.25 9.81 10.25 11.47 16.12 63.89 580416 0

A little more interesting. Caddy cranks out more synthetic responses than even optimized Nginx, but large HTML tests are shockingly close for all cases. Nginx does extremely well in the optimized proxy tests, with ½ p95 response times as Caddy and noticeably better throughput. Caddy does surprisingly well at those difficult large HTML responses: it maintains throughput parity with both Nginx configurations while clocking in a max turnaround way lower than Nginx. Going along with the resource graphs for 10 concurrent clients, Caddy seems to struggle with lots of little file buffers, as its max is really up there. Nginx, meanwhile, does super well at that case with about ~60,000 more requests in the 30-second window (though that’s the default configuration - I’m unsure why the optimized config didn’t reach that level).

This reinforces the prior guess about CPU constraints. For the more difficult tasks, Caddy is claiming all available CPU, while nginx is stuck with one in the default configuration but bursts to more in the optimized case. Given that the HTML test hits 400% but the rest don’t reach the ceiling, I would guess that the CPU might be in iowait for the HTML files because Caddy doesn’t have an inbuilt caching mechanism. That’s something we can investigate later. Caddy uses a little more memory, but it’s not near concerning levels yet.

The graph is crowded, but if you click on the line name to filter some of them out, you can see a garbage collection stairstep forming in the memory graph for Caddy’s proxy tests. This may be more pronounced as we go along.

Oh boy. Here comes 500 concurrent users:

500 Clients
test min median average p90 p95 max requests errors
nginx-default-synthetic 0.00 17.05 20.64 34.34 51.28 173.30 651900 1.67
caddy-default-synthetic 0.22 13.57 16.40 26.06 42.36 140.70 811764 0.00
nginx-optimized-synthetic 0.20 13.70 16.56 26.21 43.19 137.09 806307 0.00
nginx-default-html-large 1.02 50.63 117.64 279.68 464.35 26494.41 126886 0.00
caddy-default-html-large 1.24 51.16 118.28 276.32 436.14 26849.83 127016 0.00
nginx-optimized-html-large 0.99 51.24 118.19 277.46 432.25 26929.39 126922 0.00
nginx-default-html-small 0.00 17.37 21.16 35.42 52.57 168.44 637746 1.50
caddy-default-html-small 0.27 14.20 17.21 27.44 44.67 225.08 803503 0.00
nginx-optimized-html-small 0.23 16.48 19.26 30.32 46.08 220.35 706722 0.00
nginx-default-proxy 0.00 19.30 36.36 31.85 57.60 777.42 405987 0.64
caddy-default-proxy 0.40 39.62 42.51 68.43 81.30 284.87 351182 0.00
nginx-optimized-proxy 0.67 24.72 27.74 29.83 53.18 1044.81 536393 0.00

Wew lad! Now we’re cooking with gas. Most noteworthy:

  • Singly-cored nginx has started to throw errors. About 1.6% for synthetic tests, sub-1% for proxied tests, and around ~1.5% for small HTML tests. However, with more workers, Nginx can handle the load.
  • It looks like Caddy does really well when it’s solely working within its own process. The Caddy synthetic response request count is really up there.
  • The small HTML tests are a little surprising. Caddy outperforms Nginx even when Nginx is optimized. Especially given that optimized Nginx is caching and our GA release of Caddy doesn’t have sendfile, I’d expect them to be a little closer, but Caddy leads.
  • Reverse proxy latency is still a notable pain point in the Caddy tests. The resource graphs may help explain some of this:

Remember that default Nginx is starting to error out, so we’re observing the limits of what it can do with one c5.xlarge core.

I would venture to guess that Caddy spins out some sort of golang structure for each reverse proxy connection because we’re seeing the tell-tale signs of garbage collection in Caddy’s memory graph in the reverse proxy tests. In our blow-out test, that memory graph may be even more jagged. At 500 concurrent clients, whatever is sitting in garbage-collectible memory is starting to look non-trivial in size, too. The graph sits around at the 96MB mark.

1,000 Clients

OH LAWD, HE COMIN’

test min median average p90 p95 max requests errors
nginx-default-synthetic 0.00 0.76 6.62 17.27 31.72 292.06 344427 97.38
caddy-default-synthetic 0.28 28.28 34.30 43.66 92.14 225.71 825893 0.00
nginx-optimized-synthetic 0.24 28.79 34.56 43.61 91.59 249.39 812550 0.00
nginx-default-html-large 1.14 57.53 224.04 392.48 752.06 30031.42 127304 0.00
caddy-default-html-large 1.47 64.93 244.12 432.91 790.06 53751.69 127170 0.00
nginx-optimized-html-large 1.29 62.18 250.73 371.24 738.56 53439.86 127237 0.00
nginx-default-html-small 0.00 0.64 6.54 16.37 31.14 309.75 340595 98.17
caddy-default-html-small 0.31 30.92 38.07 64.27 106.41 333.18 770682 0.00
nginx-optimized-html-small 0.29 32.73 38.68 49.53 97.66 218.20 729572 0.00
nginx-default-proxy 0.00 4.09 22.94 24.15 42.49 4540.04 351324 72.59
caddy-default-proxy 0.44 62.64 95.01 114.17 157.29 1604.57 314757 0.00
nginx-optimized-proxy 0.38 53.87 57.52 66.32 110.75 262.62 516539 0.00

At 1,000 concurrent clients, the singly-cored nginx starts to fall down. Error rates are very high and the overall request count reflects that fully leveraging 4 cores saves Caddy and the optimized Nginx. In the synthetic and large HTML tests, Caddy and optimized Nginx are really pretty close.

The most noteworthy number here are the proxy tests - optimized Nginx does much better than Caddy (200,000(!) more requests and much better latency turnaround time). Bravo! Synthetic tests are sort of interesting because – remember that the entire request never hits a static HTML file or reverse proxied target – latency numbers are very close but Caddy sent back 10,000 more responses.

That last graph shows some pretty clear garbage collection happening in the golang runtime. Default-config Nginx CPU use falters presumably due to lots of network traffic chaos and connections being missed. At 1,000 clients, it’s interesting to note the lack of staircase GC patterns in Caddy’s HTML and synthetic tests - the big GC sweeps get caused due to Caddy’s reverse proxy plumbing. I’m now pretty confident that the reverse proxy code’s GC time is what ails Caddy’s reverse proxy performance.

The optimized Nginx resource graphs top out at 200%, which is a little confusing. My assumption was that auto would use all available cores, but that may not be the case. In any event, the graphs are clear: Nginx is really efficient in either configuration style. The memory is kept low, and CPU doesn’t blow out completely - but it makes me wonder if more gains are to be had in the optimized Nginx config by altering worker_processes.

Note: I’ve now tested this with a manual setting of 4 for worker_processes and it seems to behave the same. This probably warrants further investigation and potentially turning up the value beyond the physical core count.

10,000 Clients

You know what? I can’t help it. I wanted to see Nginx and Caddy fail to know what that behavior looks like, so here we are:

test min median average p90 p95 max requests errors
nginx-default-synthetic 0.00 1.17 50.17 153.57 304.43 1293.23 332152 99.18
caddy-default-synthetic 0.19 306.45 386.70 676.05 1084.56 2052.79 727305 0.00
nginx-optimized-synthetic 0.00 2.38 54.26 130.25 387.82 2311.45 324380 95.98
nginx-default-html-large 0.00 1.45 67.31 254.32 463.81 3715.41 324881 99.80
caddy-default-html-large 1.71 218.42 2874.14 6649.94 17990.31 58944.72 124604 0.00
nginx-optimized-html-large 0.00 1.53 100.99 142.84 749.10 3496.60 307948 95.71
nginx-default-html-small 0.00 1.16 48.19 127.09 281.31 1629.37 329233 99.31
caddy-default-html-small 0.22 343.55 438.22 838.89 1149.83 1932.29 668881 0.00
nginx-optimized-html-small 0.00 2.52 50.68 130.00 357.75 2664.59 325973 96.36
nginx-default-proxy 0.00 1.30 61.41 209.51 433.00 2507.88 327086 99.89
caddy-default-proxy 13.35 1228.58 1286.12 1406.96 1498.23 21282.06 230619 0.00
nginx-optimized-proxy 0.00 2.06 47.57 123.78 252.50 2539.05 326510 98.55

Now this is interesting. First of all – our axes are blown out. It’s more worthwhile to look at the error and total request count charts and the table for everything else.

And… we have extremely different failure behavior! Nginx, once at capacity, will start refusing connections. Error rates are high and requests are down. By contrast, Caddy maintains an error rate of zero but sacrifices its latency turnaround times at the altar of “no errors”. You can see this clearly by comparing latency versus request count in the visual aid: Caddy starts to really lag but doesn’t give up on any requests, so responses are really slow but we get a hell of a lot more of them through. However, that’s only true for the synthetic tests and small HTML tests. For proxied requests and large HTML responses, Nginx responds with more, albeit with errors.

skeleton-meme-1000x868-9fea91.png

Nginx actually does a good job of keeping normal requests moving quickly when it drops those it can’t handle. Median and mode are decent even when it refuses many incoming connections – the ones that make it through are kept moving along speedily.

In trying to ensure that clients are served at any cost, Caddy starts eating up resources. Memory starts to climb in contrast to Nginx staying low in order to maintain better turnaround time for the clients that it does accept connections from. It seems like Caddy may want to spend more time ensuring that network writes are flushing out buffers to avoid bloating memory.

In particular, that large HTML response Caddy memory graph is troublesome. My Y-axis labels are starting to get drunk, but Caddy is really pushing it and memory reaches 1GB at one point. Oops! The jagged Caddy CPU graphs make me think that there may be a pause somewhere, but I’m not sure (I want to say that a mark and sweep might’ve happened, but that graph ranges for 30 seconds and I have a hard time imagining a GC running for multiple seconds, but maybe that is indeed what occurred).

Summary

Image via DALL·E

Image via DALL·E

What did we learn?

  • The most striking piece of new knowledge for me was learning about failure modes. Nginx will fail by refusing or dropping connections, Caddy will fail by slowing everything down. Is one better than the other? For certain use-cases, almost certainly. Some folks will want fail-fast, while others will want to keep accepting connections at all costs. The key point is that there is a difference (frankly I think my preference is for fail-fast).
  • As predicted, “synthetic” responses seem to be easiest, followed by small HTML file content, followed by reverse proxied requests, followed by large HTML content. Nginx’s caching behaviors let it really shine with static asset files (and, presumably, its sendfile capability).
  • Caddy pays the cost for garbage collection, but it’s not an oppressive cost (at least at “normal” traffic levels). Nginx uses an almost effortless amount of memory. Caddy peaked at almost 160MB real memory allocated for non-breaking tests, which may or may not be significant depending on what amount of total memory is available to the OS. It remains to be seen what specific code path is causing all this malloc/free, but my tests seem to point at whatever mechanism underlies reverse_proxy.
  • Caddy’s default configuration is good. We get all our cores unlocked and we rev up every resource available to use without any memory leaks. Recall that my “default” and “optimized” Caddy configurations are identical. Nginx is scoped to a single core by default (you can learn this from the documentation, but we’ve observed that against Caddy directly now).

Before you hit really oppressive levels of traffic, Caddy and optimized Nginx are going to serve you well. My above bullet points are all important to consider at the edge cases, but based upon what I’ve observed, there are likely few times that the differences will come into play. Maybe when you get absolutely flooded with traffic. Maybe when this post gets flooded with traffic. Come at me bro, I’m on S3. Just try and take down Amazon.

Is there an answer to “is Caddy or Nginx better”? I don’t think so, but armed this knowledge, maybe you can make more informed decisions. There are other factors to consider when selecting a reverse proxy aside from performance alone. Do you want the bells and whistles that Caddy includes like first-part support for Let’s Encrypt and a native API? Do you want to carry your Nginx knowledge directly over to the well-supported k8s Nginx ingress controller? Does a C vs. golang runtime matter to you, from a performance, security, or profiling perspective?

I hope that this was a helpful exercise and provides useful data for future operations engineers. I really went overboard after initially asking, “are there hard numbers for this?” and ended up here. I’ll probably keep using Caddy where I am today – I make active, regular use of its Let’s Encrypt features, miscellaneous plugins, native API, and so on – but might turn to Nginx more often if I know I’ll be dealing with ungodly levels of inbound traffic.

I have no idea where this post will be shared, but you can either comment near the bottom in my embedded Discourse comments or I can find it on your tech news aggregator of choice. I’ll be there to answer questions and leave overly verbose comments.


Appendix A: Code

So, you want to run all these tests? Merry Christmas, there’s nix and terraform under the tree.

That repository should have all the requisite documentation necessary to reproduce these results. You’ll need an AWS account and a local nix installation, but the rest is fairly automated. Hopefully I’ve done a good enough job that your results will be very close to mine. I’m not sure whether it matters, but my tests were performed in us-west-2. The instances sizes, NixOS version, and other specifics are defined in the repository.

My bench.sh test driver isn’t great – it’s sort-of-brittle bash but you’re welcome to carve it up. It doesn’t handle the Caddy sendfile tests out of the box. But hopefully it’s a good starting point.

Appendix B: Caddy sendfile

💡 I’ve revised these findings after Matt found that I had actually failed to properly include the sendfile patch. Whoops! If you’re coming to this section again, please note the the current findings are accurate and any potentially cached copy may be outdated. I’m making ablutions by including some “no metrics” changes as well to measure the performance.

Okay, Matt. You ask and I deliver.

Per this conversation, Caddy has just merged sendfile support. This is good! sendfile avoids spurious memory operations and should speed things up. So let’s try it!

My original sendfile tests used some nix overrides to build a specific revision of caddy at the upstream version that has the changes present, but this nix issue is super annoying and means I can’t do it easily. So I’ve simply grabbed the relevant .patch files and applied them to specific Caddy builds in my benchmarking test harness (you can see the patches in the benchmarking source).

I then ran the same barrage of tests against a concurrency level of 500 200 because that number seemed to push things far without spilling over into bad p95 and max values (I re-ran my tests at 200 because 500 seemed to introduce some jitter in these specific tests). With those results, I can plot them against the other mainline Caddy 2.5.2 results. I’d prefer to compare apples to apples here instead of another full suite of sendfile-enabled Caddy against all variants of Nginx.

Note that, at the time of this re-testing, Matt also asked for some benchmarks against Caddy without metrics enabled, which can squeeze out additional performance. I probably owe Matt this one since I’ve been propagating misinformation about my erroneous sendfile results so I threw that one into the mix as well.

Note that I struck the “large” HTML tests from the graph as they blow out the Y-axis:

test min median average p90 p95 max requests errors
caddy-default-synthetic-baseline 0.14 4.53 6.16 13.00 16.25 215.46 937165 0.00
caddy-default-synthetic-no-metrics 0.14 4.50 6.16 13.02 16.24 73.30 941124 0.00
caddy-default-synthetic-sendfile 0.14 4.49 6.12 12.89 16.21 214.50 941567 0.00
caddy-default-html-large-baseline 0.82 34.57 47.23 88.31 214.74 3095.93 126833 0.00
caddy-default-html-large-no-metrics 0.80 33.72 46.95 90.90 213.69 6402.51 126885 0.00
caddy-default-html-large-sendfile 0.77 33.43 47.18 91.56 214.23 3245.40 126864 0.00
caddy-default-html-small-baseline 0.19 5.14 7.03 14.74 18.94 82.12 838601 0.00
caddy-default-html-small-no-metrics 0.18 4.82 6.43 12.78 17.03 71.13 910872 0.00
caddy-default-html-small-sendfile 0.19 5.16 7.01 14.59 18.80 81.29 841168 0.00
caddy-default-proxy-baseline 0.27 12.06 14.57 28.26 35.14 127.49 409339 0.00
caddy-default-proxy-no-metrics 0.29 11.72 14.18 27.42 34.15 227.16 420764 0.00
caddy-default-proxy-sendfile 0.26 12.11 14.62 28.33 35.27 131.89 407818 0.00

What do we end up with this time?

  • At first blush, it appears that disabling metrics really does offer a significant performance boost! The small HTML tests in particular really rip through the request count at 910872 total and better latency metrics, too.
  • The sendfile patch does seem to improve synthetic tests pretty reliably and does so modestly for small HTML tests.

You might think – as I did – that some of these patches weren’t correctly applied as some of those sendfile results are really close to baseline, but I validated my systems under test pretty heavily (to the extent I was doing some strace validation during some pre-flight checks), so I’m pretty confident that each of the three configurations are doing what they say they’re doing. More than anything else it suggests to me that some more science may be useful here to understand why exactly the performance profiles look the way they do. Matt has suggested that changing proxy buffers may have an impact, so I’ll probably add some more findings later on in another appendix for some further optimizations to the Caddy reverse proxy configuration in order to try and determine what the interaction between those settings and performance might be.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK