KPIs for server performance 101

Businesses run and stand on numbers and servers. Since we have already moved to the Zero Trust era in IT, even servers have to be monitored for their numbers, in other words: KPIs. Since your servers—both on-premise and cloud—have been serving you well, it might seem like you don't have to quantify your server performance and hold it against a benchmark.

If this thought has ever crossed your mind, the first question to ask yourself is: Are the servers ready to handle the growth of the business? This may not seem like a problem at first glance, but appearances are deceptive. Ask these questions to know if your IT infrastructure is really up for the task.

Are the servers either under-utilized or pushed to their limits?
Is the commissioned CPU and disk capacity enough to handle the present and future load?
Are the processes running on the servers healthy?
Does the network throughput look suspicious?
How much of a memory usage spike is harmless?

KPIs are the answers to these questions. Now, we are about to see some objective answers to the question: What should you monitor in your servers? If you have been hearing the cliched answer of "it depends" to the question above, this article is not going to say the same.

Uptime as a KPI

Uptime or availability as a KPI is just one piece of the puzzle. It's fairly simple to detect a server going down. One of the worst case scenarios is learning about a server going down from a customer's harshly worded email. However, if you set your KPIs right, you will know that an outage is impending before it happens. Nearly 100% availability or uptime is one important KPI. Now, let's see how to set KPIs to achieve 99.999% (or even better) availability.

CPU utilization

Setting a CPU utilization KPI is tricky. CPU utilization is a fundamental metric that shows how much of your server's processing power is being utilized. If this metric is too-high, it could mean that your server is being used to its fullest potential or your server is looking at a potential bottleneck. Ideally, an average CPU utilization of up to 75% is considered very safe. In virtual environments, running your servers at an average utilization of 75% for prolonged periods of time can result in throttling if your VMs share physical resources with others.

If your servers run high-performance computing (HPC) applications, chances are great that they are already optimized and designed to handle high CPU utilization for prolonged periods. For these servers, utilization up to even 90% is acceptable. An associated risk with this high KPI is that even a temporary spike may have drastic effects such as unresponsiveness.

For servers running business-critical processes, having CPU utilization below 50% is considered a safe limit. In the event of a failover, the backup server can handle the full load without much disruption.

If your budget allows and it is required, auto-scaling in cloud environments can help keep CPU utilization managed by creating additional instances when utilization breaches safe thresholds.

Another KPI to consider is the processor queue length (Windows) or the load average (Linux). It is the number of activities pending for the processor to action. A higher number in this KPI means that your processor is overloaded with work, meaning your application is waiting for server CPU time to finish processing another task.

Memory utilization

RAM or memory utilization is another critical KPI to be monitored. Let us take database servers as examples. They require enough memory to process large datasets efficiently. When the available memory decreases, the infamous swapping occurs. Swapping is when the virtual memory is exhausted and the server replaces it with disk space, resulting in poor performance. To know if your servers are getting dependent on swap memory, setting swap memory utilization as one more KPI will help.

As a guideline for on-premise servers and VMs with no auto-scale option, a memory utilization of 80% is considered safe. This limit keeps your servers utilized and also provides a cushion for unexpected memory utilization spikes.

For memory-hungry applications like in-memory caching systems or large databases, a higher threshold is acceptable. The increase in threshold is directly proportional to the need for close monitoring. In case of critical systems, a lower memory utilization threshold of 60% is recommended so that even in case of a failover, the backup server can take up the load.

Disk I/O and utilization

Disk I/O is essential for all servers, and more so for applications that depend on read-write operations. There are three key KPIs that can show the present and future performance of the disk:

Available disk space Self-explanatory.
Disk queue length The number of disk activities that are waiting to be processed.
Disk IOPS An indicator of the input-output operations being processed by the disk.

The industry standard is to keep the disk queue length below 0.5 per disk spindle. If your disk queue length is consistently higher for a period of time, it means that your disk subsystems are struggling to keep up with the input and output requests. This can lead to slow response times and degraded performances, which we do not want in business.

If your disks are cloud based, running high disk IOPS can lead to throttling and most cloud providers limit performance to prevent overuse. High disk IOPS impacts costs along with performance.

Network throughput

If you know this by different names, network throughput is the rate at which data transfer happens over a network. The safe level of network throughput utilization depends on the server's role and the capacity and infrastructure of the network. If your business KPIs is impacted even with slight network latency of even milliseconds, like in trading, your network throughput utilization KPIs like latency, jitter, and packet loss are to be monitored very closely.

A generic guideline is to keep network utilization below the 80% mark of the total bandwidth to allow some clearance for sudden bursts of traffic. If you hit this network throughput KPIs consistently, then you can consider upgrading to better NIC (network interface controller) with higher bandwidths and better switches.

In cloud servers and content delivery networks (CDNs), network bandwidth is often limited based on instance size and type. Safe utilization means a utilization that does not trigger the throttling mechanisms.

In addition to just network throughput, expand your monitoring spotlight to latency, jitter, and packet loss if your application or product is influenced significantly by network utilization. As an added bonus, monitoring your network KPIs can get you the added advantages like detecting DDoS attacks and crypto mining.

What are the KPIs to determine the server's network throughput?

Packets sent and received
Data sent and received
Network interfaces availability

What did we learn today?

Looking at all these KPIs and specific metrics, some of them might be of more interest to you than the others. This is expected and encouraged. This is an indicator that your application/product is impacted more by those particular KPIs than the others. Every server serves a purpose i.e, the application of the server. Hence, holding the same KPIs for all the servers under your organization might not be fruitful in the long run.

Monitor the server's performance against its purpose. For example, if a server's primary purpose is to run a process, then the two very important questions we should ask before setting the KPIs are:

What are all the KPIs that will affect the server's ability to run the process?
How will we know the process' availability?

These questions can be tweaked according to the server's purpose, be it a database server, an app server, a caching server, or a file transfer server.

These are general guidelines to point you in the right direction in optimizing your servers' performance. Quantifying your server performance with the help of KPIs will help you identify the bottlenecks at hand, the future bottlenecks, and the avenues to improve IT infrastructure as a whole.

Was this article helpful?

Sorry to hear that. Let us know how we can improve the article.