hope this shed some light into all.
As of this writing, these details are accurate from linux kernel v2.2 onwards, and where possible highlights the differences in settings between kernels up through v5.5.
Kernel settings outlined in this article can be adjusted following the primer on configuring kernel settings here:
TCP Receive Queue and netdev_max_backlog
Each CPU core can hold a number of packets in a ring buffer before the network stack is able to process them. If the buffer is filled faster than TCP stack can process them,
a dropped packet counter is incremented and they will be dropped. The net.core.netdev_max_backlog setting should be increased to maximize the number of packets queued for processing on servers with high burst traffic.
net.core.netdev_max_backlog is a per CPU core setting.
TCP Backlog Queue and tcp_max_syn_backlog
A connection is created for any SYN packets that are picked up from the receive queue and are moved to the SYN Backlog Queue. The connection is marked “SYN_RECV” and a SYN+ACK is sent back to the client.
These connections are not moved to the accept queue until the corresponding ACK is received and processed. The maximum number of connections in the queue is set in the net.ipv4.tcp_max_syn_backlog kernel setting.
Run the following netstat command to check the receive queue, which should be no higher than 1 on a properly configured server under normal load, and should be under the SYN backlog queue size under heavy load:
# netstat -an | grep SYN_RECV | wc -l
If there are a lot of connections in SYN_RECV state, there are some additional settings that can increase the duration a SYN packet sits in this queue and cause problems on a high performance server.
SYN Cookies
If SYN cookies are not enabled, the client will simply retry sending a SYN packet. If SYN cookies are enabled (net.ipv4.tcp_syncookies), the connection is not created and is not placed in the SYN backlog, but a SYN+ACK packet
is sent to the client as if it was. SYN cookies may be beneficial under normal traffic, but during high volume burst traffic some connection details will be lost and the client will experience issues when the connection is established.
There’s a bit more to it than just the SYN cookies, but here’s a write up called “SYN cookies ate my dog” written by Graeme Cole that explains in detail why enabling SYN cookies on high performance servers can cause issues.
SYN+ACK Retries
What happens when a SYN+ACK is sent but never gets a response ACK packet? In this case, the network stack on the server will retry sending the SYN+ACK. The delay between attempts are calculated to allow for server recovery.
If the server receives a SYN, sends a SYN+ACK, and does not receive an ACK, the length of time a retry take follows the Exponental Backoff algorithm and therefore depends on the retry counter for that attempt. The kernel setting
that defines the number of SYN+ACK retries is net.ipv4.tcp_synack_retries with a default setting of 5. This will retry at the following intervals after the first attempt: 1s, 3s, 7s, 15s, 31s. The last retry will timeout after roughly 63s after the first attempt was made, which corresponds to when the next attempt would have been made if the number of retries was 6. This alone can keep a SYN packet in the SYN backlog for more than 60 seconds before the packet times out. If the SYN backlog queue is small, it doesn’t take a large volume of connections to cause an amplification event in the network stack where half-open connections never complete and no connections can be established. Set the number of SYN+ACK retries to 0 or 1 to avoid this behavior on high performance servers.
SYN Retries
Although SYN retries refer to the number of times a client will retry sending a SYN while waiting for a SYN+ACK, it can also impact high performance servers that make proxy connections. An nginx server making a few dozen proxy connections to a backend server due to a spike of traffic can overload the backend server’s network stack for a short period, and retries can create an amplification on the backend on both the receive queue and the SYN backlog queue. This, in turn, can impact the client connections being served. The kernel setting for SYN retries is net.ipv4.tcp_syn_retries and defaults to 5 or 6 depending on distribution. Rather than retry for upwards of 63–130s, limit the number of SYN retries to 0 or 1.
See the following for more information on addressing client connection issues on a reverse proxy server:
Linux Kernel Tuning for High Performance Networking: Ephemeral Ports
levelup.gitconnected.com
TCP Accept Queue and somaxconn
Applications are responsible for creating their accept queue when opening a listener port when callinglisten() by specifying a “backlog” parameter. As of linux kernel v2.2, this parameter changed from setting the maximum number of incomplete connections a socket can hold to the maximum number of completed connections waiting to be accepted. As described above, the maximum number of incomplete connections is now set with the kernel setting net.ipv4.tcp_max_syn_backlog.
somaxconn and the TCP listen() backlog
Although the application is responsible for the accept queue size on each listener it opens, there is a limit to the number of connections that can be in the listener’s accept queue. There are two settings that control the size of the queue: 1) the backlog parameter on the TCP listen() call, and 2) a kernel limit maximum from the kernel setting net.core.somaxconn.
Accept Queue Default
The default value for net.core.somaxconn comes from theSOMAXCONN constant, which is set to 128 on linux kernels up through v5.3, while SOMAXCONN was raised to 4096 in v5.4. However, v5.4 is the most current version at the time of this writing and has not been widely adopted yet, so the accept queue is going to be truncated to 128 on many production systems that have not modified net.core.somaxconn.
Applications typically use the value of the SOMAXCONN constant when configuring the default backlog for a listener if it is not set in the application configuration, or it’s sometimes simply hard-coded in the server software. Some applications set their own default, like nginx which sets it to 511 — which is silently truncated to 128 on linux kernels through v5.3. Check the application documentation for configuring the listener to see what is used.
Accept Queue Override
Many applications allow the accept queue size to be specified in the configuration by providing a “backlog” value on the listener directive or configuration that will be used when calling listen(). If an application calls listen() with a backlog value larger than net.core.somaxconn, then the backlog for that listener will be silently truncated to the somaxconn value.
Application Workers
If the accept queue is large, also consider increasing the number of threads that can handle accepting requests from the queue in the application. For example, setting a backlog of 20480 on an HTTP listener for a high volume nginx server without allowing for enough worker_connections to manage the queue will cause connection refused responses from the server.
Connections and File Descriptors
System Limit
Every socket connection also uses a file descriptor. The maximum number of all file descriptors that can be allocated to the system is set with the kernel setting fs.file-max. To see the current number of file descriptors in use, cat the following file:
# cat /proc/sys/fs/file-nr
1976 12 2048
The output shows that the number of file descriptors in use is 1976, the number of allocated but free file descriptors is 12 (on kernel v2.6+), and the maximum is 2048. On a high performance system, this should be set high enough to handle the maximum number of connections and any other file descriptor needs for all processes on the system.
User Limit
In addition to the file descriptor system limit, each user is limited to a maximum amount of open file descriptors. This is set with the system’s limits.conf (nofile), or in the processes systemd unit file if running a process under systemd (LimitNOFILE). To see the maximum number of file descriptors a user can have open by default:
$ ulimit -n
1024
And under systemd, using nginx as an example:
$ systemctl show nginx | grep LimitNOFILE
4096
Settings
To adjust the system limit, set the fs.max-file kernel setting to the maximum number of open file descriptors the system can have, plus some buffer. Example:
fs.file-max = 3261780
To adjust the user limit, set the value high enough to handle the number of connection sockets for all listeners plus any other file descriptor needs for the worker processes, and include some buffer. User limits are set under /etc/security/limits.conf, or a conf file under /etc/security/limits.d/, or in the systemd unit file for the service. Example:
# cat /etc/security/limits.d/nginx.conf
nginx soft nofile 64000
nginx hard nofile 64000
# cat /lib/systemd/system/nginx.service
[Unit]
Description=OpenResty Nginx - high performance web server
Documentation=https://www.nginx.org/en/docs/
After=network-online.target remote-fs.target nss-lookup.target
Wants=network-online.target
[Service]
Type=forking
LimitNOFILE=64000
PIDFile=/var/run/nginx.pid
ExecStart=/usr/local/openresty/nginx/sbin/nginx -c /usr/local/openresty/nginx/conf/nginx.conf
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -s TERM $MAINPID
[Install]
WantedBy=multi-user.target
Worker Limits
Like file descriptor limits, the number of workers, or threads, that a process can create is limited by both a kernel setting and user limits.
System Limit
Processes can spin up worker threads. The maximum number of all threads that can be created is set with the kernel setting kernel.threads-max. To see the max number of threads along with the current number of threads executing on a system, run the following:
# cat /proc/sys/kernel/threads-max
257083
# ps -eo nlwp | tail -n +2 | \
awk '{ num_threads += $1 } END { print num_threads }'
576
As long as the total number of threads is lower than the max, the server will be able to spin up new threads for processes as long as they’re within user limits.
User Limit
In addition to the max threads system limit, each user process is limited to a maximum number of threads. This is again set with the system’s limits.conf (nproc), or in the processes systemd unit file if running a process under systemd (LimitNPROC). To see the maximum number of threads a user can spin up:
$ ulimit -u
4096
And under systemd, using nginx as an example:
$ systemctl show nginx | grep LimitNPROC
4096
Settings
In most systems, the system limit is already set high enough to handle the number of threads a high performance server needs. However, to adjust the system limit, set the kernel.threads-max kernel setting to the maximum number of threads the system needs, plus some buffer. Example:
kernel.threads-max = 3261780
To adjust the user limit, set the value high enough for the number of worker threads needed to handle the volume of traffic including some buffer. As with nofile, the nproc user limits are set under /etc/security/limits.conf, or a conf file under /etc/security/limits.d/, or in the systemd unit file for the service. Example, with nproc and nofile:
# cat /etc/security/limits.d/nginx.conf
nginx soft nofile 64000
nginx hard nofile 64000
nginx soft nproc 64000
nginx hard nproc 64000
# cat /lib/systemd/system/nginx.service
[Unit]
Description=OpenResty Nginx - high performance web server
Documentation=https://www.nginx.org/en/docs/
After=network-online.target remote-fs.target nss-lookup.target
Wants=network-online.target
[Service]
Type=forking
LimitNOFILE=64000
LimitNPROC=64000
PIDFile=/var/run/nginx.pid
ExecStart=/usr/local/openresty/nginx/sbin/nginx -c /usr/local/openresty/nginx/conf/nginx.conf
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -s TERM $MAINPID
[Install]
WantedBy=multi-user.target
TCP Reverse Proxy Connections in TIME_WAIT
Under high volume burst traffic, proxy connections stuck in “TIME_WAIT” can add up tying up many resources during the close connection handshake. This state indicates the client is waiting for a final FIN packet from the server (or upstream worker) that may never come. In most cases, this is normal and expected behavior and the default of 120s is acceptable. However, when the volume of connections in the “TIME_WAIT” state is high, this can cause the application to run out of worker threads to handle new requests or client sockets to connect to the upstream. It’s better to let these time out faster. The kernel setting that controls this timeout is net.ipv4.tcp_fin_timeout and a good setting for a high performance server is between 5 and 7 seconds.
Bringing it All Together
The receive queue should be sized to handle as many packets as linux can process off of the NIC without causing dropped packets, including some small buffer in case spikes are a bit higher than expected. The softnet_stat file should be monitored for dropped packets to discover the correct value. A good rule of thumb is to use the value set for tcp_max_syn_backlog to allow for at least as many SYN packets that can be processed to create half-open connections. Remember, this is the number of packets each CPU can have in its receive buffer, so divide the total desired by the number of CPUs to be conservative.
The SYN backlog queue should be sized to allow for a large number of half-open connections on a high performance server to handle bursts of occasional spike traffic. A good rule of thumb is to set this at least to the highest number of established connections a listener can have in the accept queue, but no higher than twice the number of established connections a listener can have. It is also recommended to turn off SYN cookie protection on these systems to avoid data loss on high burst initial connections from legitimate clients.
The accept queue should be sized to allow for holding a volume of established connections waiting to be processed as a temporary buffer during periods of high burst traffic. A good rule of thumb is to set this between 20–25% of the number of worker threads.
Configurations
The following kernel settings were discussed in this article:
# /etc/sysctl.d/00-network.conf
# Receive Queue Size per CPU Core, number of packets
# Example server: 8 cores
net.core.netdev_max_backlog = 4096
# SYN Backlog Queue, number of half-open connections
net.ipv4.tcp_max_syn_backlog = 32768
# Accept Queue Limit, maximum number of established
# connections waiting for accept() per listener.
net.core.somaxconn = 65535
# Maximum number of SYN and SYN+ACK retries before
# packet expires.
net.ipv4.tcp_syn_retries = 1
net.ipv4.tcp_synack_retries = 1
# Timeout in seconds to close client connections in
# TIME_WAIT after waiting for FIN packet.
net.ipv4.tcp_fin_timeout = 5
# Disable SYN cookie flood protection
net.ipv4.tcp_syncookies = 0
# Maximum number of threads system can have, total.
# Commented, may not be needed. See user limits.
#kernel.threads-max = 3261780
# Maximum number of file descriptors system can have, total.
# Commented, may not be needed. See user limits.
#fs.file-max = 3261780
The following user limit settings were discussed in this article:
# /etc/security/limits.d/nginx.conf
nginx soft nofile 64000
nginx hard nofile 64000
nginx soft nproc 64000
nginx hard nproc 64000
Conclusion
The settings in this article are meant as examples and should not be copied directly into your production server configuration without testing. There are also some additional kernel settings that have an impact on network stack performance. Overall, these are the most important settings that I’ve used when tuning the kernel for high performance connections.
WRITTEN BY
John Patton
Follow
Following the motto: “Share what you know, learn what you don’t”
hope it helps