TL;DR Load testing and scaling architecture analysis for a web application using Apache Bench, exploring performance bottlenecks and scaling strategies.
Today I wanted to have a closer look at how my web application Formshive performs in different scenarios. It’s probably been 6 months since I last ran a benchmark at work, so I had a quick look around, but ended up with Apache Bench - not only because I’m familiar with it, but because it does exactly what I need.
Tool | Description | Pros | Cons |
---|---|---|---|
Apache Bench | HTTP benchmarking tool from Apache | Easy to use | Single-threaded |
wrk | Modern HTTP benchmarking tool | Multi-threaded | No time-based metrics |
Artillery | Node.js load testing toolkit | HTTP/WebSocket support | Node.js overhead |
k6 | Developer-centric load testing tool | JavaScript scripting | CLI-only in free version |
Benchmark with Apache Bench
Let’s start with Apache Bench.
Setup
guix shell httpd
mkdir /tmp/ab-test && cd /tmp/ab-test
echo "email=your-email@gmail.com&name=Mike&message=Hi, I want to enquire about ...." > form-data.txt
The tests will run against
- a local Dockerized server on my machine, so there’s little overhead (internet, HAProxy, SSL, …).
- The system (AMD Ryzen 5 7640U, 64GB memory) is about ~90% idle
- The DB pool is set to a connection limit of 100, and PostgreSQL is the default Docker image
Default Form
I have created a form without spam checks, field validation, or integrations (email, webhooks, etc.) to keep it simple. All that happens here is:
- check whether the form exists
- check that the account has a subscription, and / or enough balance to pay for the digest
- lookup the IP address from the maxmind database, and parse browser user agent
$ ab -n 1000 -c 10 -p form-data.txt -T application/x-www-form-urlencoded http://localhost:8003/v1/digest/b35f0ee6-23fe-4a0d-bf65-312315493775?redirect=none
This is ApacheBench, Version 2.3 <$Revision: 1923142 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests
Server Software:
Server Hostname: localhost
Server Port: 8003
Document Path: /v1/digest/b35f0ee6-23fe-4a0d-bf65-312315493775?redirect=none
Document Length: 0 bytes
Concurrency Level: 10
Time taken for tests: 4.745 seconds
Complete requests: 1000
Failed requests: 0
Total transferred: 197000 bytes
Total body sent: 290000
HTML transferred: 0 bytes
Requests per second: 210.76 [#/sec] (mean)
Time per request: 47.448 [ms] (mean)
Time per request: 4.745 [ms] (mean, across all concurrent requests)
Transfer rate: 40.55 [Kbytes/sec] received
59.69 kb/s sent
100.23 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.2 0 2
Processing: 4 47 25.7 44 149
Waiting: 4 47 25.7 44 148
Total: 4 47 25.7 45 149
Percentage of the requests served within a certain time (ms)
50% 45
66% 56
75% 62
80% 66
90% 81
95% 94
98% 111
99% 122
100% 149 (longest request)
A mean request time of 47ms with 10 concurrent requests is not too bad.
Let’s increase the concurrency to 100 and see how it performs under higher load:
$ ab -n 1000 -c 100 -p form-data.txt -T application/x-www-form-urlencoded http://localhost:8003/v1/digest/b35f0ee6-23fe-4a0d-bf65-312315493775?redirect=none
Document Path: /v1/digest/b35f0ee6-23fe-4a0d-bf65-312315493775?redirect=none
Document Length: 0 bytes
Concurrency Level: 100
Time taken for tests: 4.381 seconds
Complete requests: 1000
Failed requests: 0
Total transferred: 197000 bytes
Total body sent: 290000
HTML transferred: 0 bytes
Requests per second: 228.24 [#/sec] (mean)
Time per request: 438.141 [ms] (mean)
Time per request: 4.381 [ms] (mean, across all concurrent requests)
Transfer rate: 43.91 [Kbytes/sec] received
64.64 kb/s sent
108.55 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 1.0 0 5
Processing: 10 402 118.9 430 626
Waiting: 5 402 118.9 430 626
Total: 10 403 118.1 430 626
Percentage of the requests served within a certain time (ms)
50% 430
66% 447
75% 460
80% 470
90% 493
95% 513
98% 543
99% 574
100% 626 (longest request)
Our mean request time is up by 10x, now at 438ms with 100 concurrent requests. This is expected, as the server has to handle more simultaneous connections, which increases the processing time.
There’s a lot of room for optimization. In a production environment, a load balancer would distribute the load across multiple instances, which would help keep the request times on the lower end of ~5-50ms.
Let’s test more sustained load with 10,000 requests and 100 concurrent connections:
$ ab -n 10000 -c 100 -p form-data.txt -T application/x-www-form-urlencoded http://localhost:8003/v1/digest/b35f0ee6-23fe-4a0d-bf65-312315493775?redirect=none
Document Path: /v1/digest/b35f0ee6-23fe-4a0d-bf65-312315493775?redirect=none
Document Length: 0 bytes
Concurrency Level: 100
Time taken for tests: 48.740 seconds
Complete requests: 10000
Failed requests: 0
Total transferred: 1970000 bytes
Total body sent: 2900000
HTML transferred: 0 bytes
Requests per second: 205.17 [#/sec] (mean)
Time per request: 487.402 [ms] (mean)
Time per request: 4.874 [ms] (mean, across all concurrent requests)
Transfer rate: 39.47 [Kbytes/sec] received
58.10 kb/s sent
97.58 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.4 0 4
Processing: 10 484 59.4 480 743
Waiting: 6 484 59.4 480 743
Total: 10 484 59.2 480 743
Percentage of the requests served within a certain time (ms)
50% 480
66% 499
75% 512
80% 522
90% 546
95% 570
98% 598
99% 619
100% 743 (longest request)
The mean request time has now increased to 487ms with 10,000 requests at 100 concurrent connections with 100% CPU utilization. Not too bad for this little Ryzen 5, and no optimization whatsoever.
~ 17,726,688 requests per day.
SPAM Checks
I’ve setup a new form; Everything remains the same, except that we’re now checking the message for spam, and lookup whether the user has reported the email as spam. This is quite heavy, because the bayesian spam filter. The filter is CPU-bound so for the first test I will use a concurrency of 1:
$ ab -n 100 -c 1 -p form-data.txt -T application/x-www-form-urlencoded http://localhost:8003/v1/digest/5fce6d21-a9a5-4e9e-a598-896818765a80?redirect=none
Document Path: /v1/digest/5fce6d21-a9a5-4e9e-a598-896818765a80?redirect=none
Document Length: 0 bytes
Concurrency Level: 1
Time taken for tests: 49.728 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 19700 bytes
Total body sent: 29000
HTML transferred: 0 bytes
Requests per second: 2.01 [#/sec] (mean)
Time per request: 497.281 [ms] (mean)
Time per request: 497.281 [ms] (mean, across all concurrent requests)
Transfer rate: 0.39 [Kbytes/sec] received
0.57 kb/s sent
0.96 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 5 497 901.5 8 2585
Waiting: 5 497 901.5 8 2585
Total: 5 497 901.5 8 2586
Percentage of the requests served within a certain time (ms)
50% 8
66% 16
75% 177
80% 2053
90% 2181
95% 2222
98% 2357
99% 2586
100% 2586 (longest request)
You can already see, that the request time increases quickly, because even though the spam check runs in a separate thread, the overall load is at 100%.
Let’s see what happens, when we increase the concurrency to 10:
$ ab -n 1000 -c 10 -p form-data.txt -T application/x-www-form-urlencoded http://localhost:8003/v1/digest/5fce6d21-a9a5-4e9e-a598-896818765a80?redirect=none
Document Path: /v1/digest/5fce6d21-a9a5-4e9e-a598-896818765a80?redirect=none
Document Length: 0 bytes
Concurrency Level: 10
Time taken for tests: 325.132 seconds
Complete requests: 1000
Failed requests: 0
Total transferred: 197000 bytes
Total body sent: 290000
HTML transferred: 0 bytes
Requests per second: 3.08 [#/sec] (mean)
Time per request: 3251.321 [ms] (mean)
Time per request: 325.132 [ms] (mean, across all concurrent requests)
Transfer rate: 0.59 [Kbytes/sec] received
0.87 kb/s sent
1.46 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.2 0 5
Processing: 6 3198 2794.1 2615 15867
Waiting: 5 3198 2794.1 2615 15866
Total: 6 3198 2794.1 2616 15867
Percentage of the requests served within a certain time (ms)
50% 2616
66% 4004
75% 4887
80% 5227
90% 7271
95% 8430
98% 10139
99% 11445
100% 15867 (longest request)
The mean request time has skyrocketed to 3251ms with 10 concurrent requests, which is expected because the SPAM check is using a significant amount of CPU cycles, and runs concurrently with the request processing.
It would be really easy to improve this by queuing the SPAM checks to be processed later, or offloading them to another server, but most of Formshive users use captchas because they are much more reliable at keeping spam out and free.
A spam check costs 0.005 Euro at the moment. This test would have cost 5 Euro.
Field Validation
Next, let’s test the field validation; There should be little overhead because we only check whether the given fields comply with the form rules:
[email]
name = "email"
label = "E-Mail"
field = "email"
placeholder = ""
helptext = ""
on_fail = "reject"
is_email = true
[name]
name = "name"
label = "Einzeiliger Text"
field = "text"
placeholder = ""
helptext = ""
on_fail = "pass"
[message]
name = "message"
label = "Absatztext"
field = "textarea"
placeholder = ""
helptext = ""
on_fail = "pass"
[settings]
discard_additional_fields = false
Let’s see how it performs with 1000 requests and 10 concurrent connections:
$ ab -n 1000 -c 10 -p form-data.txt -T application/x-www-form-urlencoded http://localhost:8003/v1/digest/9454995b-27a0-43fb-a7c1-3adf0549d61c?redirect=none
Document Path: /v1/digest/9454995b-27a0-43fb-a7c1-3adf0549d61c?redirect=none
Document Length: 0 bytes
Concurrency Level: 10
Time taken for tests: 4.713 seconds
Complete requests: 1000
Failed requests: 0
Total transferred: 197000 bytes
Total body sent: 290000
HTML transferred: 0 bytes
Requests per second: 212.16 [#/sec] (mean)
Time per request: 47.135 [ms] (mean)
Time per request: 4.713 [ms] (mean, across all concurrent requests)
Transfer rate: 40.82 [Kbytes/sec] received
60.08 kb/s sent
100.90 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.2 0 5
Processing: 6 47 22.7 43 166
Waiting: 6 46 22.7 43 165
Total: 6 47 22.7 44 166
Percentage of the requests served within a certain time (ms)
50% 44
66% 54
75% 60
80% 64
90% 76
95% 87
98% 104
99% 117
100% 166 (longest request)
Nothing surprising here; This behaves much like the default form.
Scaling
At this point, I have a pretty good idea about Formshive bottlenecks. There’s a number of things I didn’t test, but likely don’t play a role here. For example:
- Captcha requests only hit the server, if the captcha is set to ‘automatic’, or the user specifically clicks on it
- Integrations (email, webhooks, etc.) are added to a Redis queue, so they don’t block processing
- General usage: Stats, listing forms, etc. are not critical. Most users receive form notifications by email, so hardly ever login. If they do, we’re talking X requests per minute times logged-in users.
Here are all individual parts, that incur overhead on form submission:
- Spam check
- Maxmind lookup
- User agent parsing
- Field validation
- Captcha requests
- Integrations (email, webhooks, etc.)
- Subscription / balance checks
- Database
Low hanging fruit
Based on the above tests, the low-hanging fruit is the spam check; I could easily sustain 200 requests per second by queuing the spam checks in Redis, and having a separate server process them.
- 1x proxy server for SSL termination and rate limiting
- 1x application server
- 1x database server
- 1x proxy server, to delegate spam checks
- X spam check servers
More resilience, and double throughput
The next step would be easy too; I could run the database on a separate server, and run two application servers behind a load balancer; I already use HAProxy for SSL termination so this would be a minor change:
- 1x proxy server for SSL termination and rate limiting
- 2x application server
- 1x database server
- 1x proxy server, to delegate spam checks
- X spam check servers
With this setup, I would be able to sustain 400 requests per second (~ 34,560,000 requests per day) and tolerate a single server failure without downtime, but a temporary performance hit.
Based on my experience and testing (postgresql-diesel-benchmark) PostgreSQL can easily handle millions of requests, but read performance will degrade and I would have to look into indexing, or partitioning for example.
Enable Failover
Even though the previous setup easily allows us to scale to 600, or even 800 requests per second (~ 51,840,000 to 69,120,000 requests per day), it lacks redundancy; If HAProxy or PostgreSQL fails, the whole system is down. The next step would be to add a proxy server to handle the failover, and a second database server for redundancy.
- Hetzner Failover Subnet
- 2x proxy server for SSL termination and rate limiting
- 2x application server
- 2x PgBouncer instances, to handle database connections
- 1x proxy server, to handle database failover
- 3x etcd server, to handle database failover
- 2x database server
- 1x proxy server, to delegate spam checks
- X spam check servers
Proxy Server Failover
The failover between the proxy server(s) is by far the trickiest part; I’m currently on Hetzner, so it would look something like this: Failover and takes 90 to 120 seconds.
Database Failover
For the database, Patroni is a nice choice; A minimal setup would include two PostgreSQL servers, one as the primary, and one as the standby. Patroni will automatically promote the standby to primary in case of a failure. I’ve put together a minimal test here: PostgreSQL High Availability with Patroni, etcd, and HAProxy. My goal is always to keep everything as transparent as possible, that means, minimal or no changes on the application side.
Granted, this setup includes two single-failure points but there shouldn’t be any need to touch, or change them:
- PgBouncer
- HaProxy
Also, this doesn’t easily scale to more PostgreSQL servers; In theory it’s possible to balance the load between primary and standby database but this would require one of two things:
- either, a load balancer that can differentiate between read and write requests, or
- modifications to the application code, to support read and write requests
These days I’m on Diesel ORM and deadpool, and this is not supported out of the box but would be possible to implement.
Optimize for Latency
In order to optimize for latency, we would want the server as close as possible to the user; That means, whatever we’ve done to enable failover, should be done in every region we want to serve. The complexity here is, that we may want to write to the database in both regions, but maintain a single source of truth.
Options:
- Write everything to one region, and read from the closest region
- Delegate writes based on user, or user organization, to a specific region
- Write to all regions, and do conflict resolution like BDR or Pglogical
It might also be worth looking into YugabyteDB or CockroachDB for new projects.
Considerations
Legal
One consideration is your business case; US customers may prefer to keep their data in the US, while EU customers would probably prefer the EU; This might not necessarily mean less complexity, because you may still want to do regional failover (e.g. Paris, Frankfurt, …).
For Formshive, I would want to allow customers to choose which region (US, EU, SEA) they prefer to have their data stored.
Latency within a region is not a problem for Formshive, so I don’t have to worry as much about enabling writes to different datacenter, and can easily focus on failover between datacenter vs. scaling across datacenter:
There’s many ways to determine where a customer is located, but it’s usually best to ask, or at least make it obvious that there’s a choice and what we’ve defaulted to because users may not be in the region they appear to be, or prefer to use a different one.
For Formshive, one idea would be:
- Keep regional data (EU/US/SEA) isolated
- Maintain a global store with a hashed email / region mapping
When the user is trying to login, the region would first check its local store, and if the user was not found, query the global store with the hashed email as part of a signed request (regional keypair). The global store would return a signed response (global keypair), with the region the user is located in.
The connection between regional and global store would run over a wireguard VPN.
Technical
Shape of Data
The shape of your data can have an impact on how you handle failover; For append-only data like logs, billing records or form submissions, conflicts should rarely happen; For Formshive form submissions specifically, only two changes are possible:
- A user marks a submission as spam
- A user deletes a submission
One can imagine, that a large organization has multiple users accessing one form - one user may mark the submission as spam, and another user may delete the submission; In this case, we would have to resolve the conflict, and decide which change to keep.
Goals
What and how-to scale also very much depends on your goals; If for example, you wanted to run a unmetered proxy or VPN service, the tricky part is authentication and billing”; Once a user completes authentication, we can issue a JWT that contains all related information such as account validity, and the proxy server can validate the authenticity of the JWT using the public key of the authentication server. That means, it’s both easy and quick to add new servers, and scale horizontally.
Many providers limit the number of concurrent connections but this can be handled quite easily by using a distributed database, where every new connection is written to a global master, which replicates to other regions, which local VPN server query to enforce the connection limit. This keeps latency for connection requests low, and only requires a write-operation, once a new device connects.
Once the user has established a connection, there’s theoretically no need to query the connection limit again unless the user should have the ability to disconnect a device from their account.
Optimize for Traffic
In the next part of this journey, I want to look at how to optimize for both latency and massive amounts of traffic; This is where it get’s interesting, because I would want to read from, and write to the closest database without having to worry which region the user is in. That’s where conflict handling (or avoidance) comes into play.
Coming soon