MTTF and Its Value to Technical Teams
How to have reliable systems at scale is a complex problem to solve. To keep customers happy, building and operating digital services is now a priority for businesses in just about every industry. When an outage does occur, organizations rely on their DevOps and site reliability engineering (SRE) teams to navigate and solve the problem. These teams are responsible for ensuring that the organization’s applications are quickly restored and running again, and for finding ways to keep apps operating despite the outage.
One important metric these teams monitor is mean time to failure (MTTF). This metric measures the average time until a system or application fails and cannot be repaired. It helps organizations understand how long a technology product will typically last.
Generally, technical teams aim for an extended MTTF. The longer the MTTF, the less time teams spend fixing issues, the less money the organization spends replacing physical components, and the happier customers are as services remain up and running.
Before you improve your MTTF, you need to know how to calculate this valuable metric and understand its causes and consequences. Let’s dig deeper into understanding MTTF. Then, we’ll show you how to optimize your MTTF using monitoring and automation.
What is MTTF
MTTF is an incident management metric that enables businesses to measure the average time it takes for a system or application to fail the first time — and typically the only time, as MTTF is often used for systems that must be replaced upon failure. Companies use MTTF for unrepairable outages as well as system or workload component defects.
MTTF typically measures the availability and reliability of IT environments in DevOps and SRE practices. It measures the overall quality of the workloads and platform services, not the quality of the DevOps or SRE team managing these systems.
Calculate using this MTTF formula:
Total Hours of Operations Time
Total Number of Items
For example, say you run an application workload that relies on a cluster of four firewall appliances. The first appliance fails after 12,000 hours, the second fails after 14,300 hours, and the third and fourth both fail after 17,450 hours. Each failure is so destructive it makes the machine unrepairable.
MTTF = (12,000 + 14,300 + 17,450 + 17,450) hours / 4 devices
= 61,200 hours / 4 devices
= 15,300 hours per device
Based on this calculation, you can assume that every 15,300 hours, a server fails. While you can’t reasonably expect each server to fail precisely every 15,300 hours, this calculation provides a useful estimation.
Although there are similarities between MTTF and mean time between failures (MTBF), a key difference is that MTBF is typically used for incidents or outages that can be fixed by repairing or replacing a component. With MTTF, the only way to fix the problem is by replacing the entire system.
In our example, we considered the MTTF for an appliance. In the real world, MTTF typically expresses the lifetime of an individual physical component. This could be a mainboard, memory module, CPU, or storage disk. However, it could also refer to an appliance, such as an Internet of Things (IoT) device, network switch, or firewall. Such a device can be almost like a black box, where you can’t repair a single component to restore the device’s function.
A Carnegie Mellon University study identified the relationship between disk failures and other critical components in different system architectures, including:
- A high-performance compute (HPC) cluster with 765 nodes, monitored over five years, with about 30.6 percent hard drive outages
- Two compute clusters used by an Internet service provider (ISP), where one cluster had an average hard drive outage of 18.1 percent and the second had an average close to 49.1 percent
For the high-performance cluster, memory had about the same failure rate as hard drives, at 28.5 percent. The first ISP cluster had around 20 percent memory failure, but the second cluster only had 3.4 percent.
The second cluster had the largest disk MTTF. This could be because it is a storage-intensive system, relying less on memory. In the HPC cluster, the main components, storage, and memory failed at about the same rate.
This kind of MTTF data is highly valuable, especially for server vendors. The higher the percentage, the faultier the disks (and other components), which means providing more replacement parts.
Even with a fault-tolerant disk subsystem, every disk replacement operation can cause other system outages. On the other hand, an MTTF of one million hours reflects a little more than 41,666 days — over 100 years.
Data center operations teams see enough examples of disk failures to know the larger numbers don’t always reflect real-life situations. For example, disk lifespans are typically only three to five years, or sometimes seven to ten years for solid-state drives.
Assume a hardware vendor performs a disk drive lifetime test on 1,000 drives over one month (720 hours). During this time, three disks fail. Using the formula (720 x 1,000 / 3), we get an MTTF of 240,000 hours.
Again, this calculation doesn’t mean a drive will fail every 240,000 hours. A better way to explain the potential drive failure is that for every 333 disks, one disk will fail every month. This reflects 0.3 percent, and 3.6 percent per year. So, if we have 1,000 drives, 36 will fail during a single year. This calculation provides a different meaning than the theoretical (yet statistically correct) lifespan of 100 years.
Let’s now drill down into some of the causes and consequences of a low or high MTTF.
Causes and Consequences
Any application workload runs on a system. This is true whether it’s a complete multi-tiered application on a physical server or a collection of microservices running as containerized workloads.
Even public cloud environments like Azure, Amazon Web Services (AWS), and Google Cloud Platform (GCP) are nothing more than a collection of physical data center equipment. Like any other physical equipment, a public cloud environment can and will fail. As a result, these data center solutions also face maintenance, both planned and unplanned.
Any such maintenance can impact application and service delivery, reliability, revenue, customer satisfaction, and employee satisfaction. Having an accurate estimate of MTTF helps dramatically improve service reliability.
If you know when to expect a limited-life resource to fail, you can replace it before failure. Replacing resources requires having access to spare parts or full spare appliances, so you need to be prepared in advance.
On the other hand, inaccurately measuring MTTF can cause a whole series of related problems:
- Failures occur unexpectedly, causing downtime and data loss.
- Customers become unhappy about downtime and take their business elsewhere, causing revenue to decline.
- Employee morale suffers because teams constantly fix preventable emergencies.
- You replace hardware more often than necessary, increasing operating costs.
MTTF depends much on the equipment’s overall condition and circumstances — in other words, how and where equipment runs. Running equipment in unsuitable conditions is one of the leading root causes of frequent service failures.
Let’s imagine two identical physical servers. One is running in a hot, humid location, such as in a beachside building in the tropics. The other is in a climate-controlled server room in a business park in New York. The server in the tropics is much more vulnerable to failure, so we can’t assume the same MTBF for both servers.
Access to accurate information can help you more accurately measure your MTTF and optimize this metric for more reliable systems and happier customers.
How to Improve MTTF
To ensure we can drive to work in the morning, we make sure to give our cars regular check-ups. We might check tire pressure before a long ride, control water temperature during hot summer days, or keep an eye on the air conditioning fluid. Just like we can use proactive measures to be responsible drivers and car owners, DevOps and SRE teams can improve application and infrastructure MTTF by being proactive.
Monitoring is the key to improving MTTF. Monitoring helps to ensure that when something goes wrong, you have the data necessary to quickly identify and resolve the issue. Metrics, logs, and distributed tracing give you a solid foundation for resolving equipment and application failures. By adopting these into your monitoring practice, your team can more efficiently and immediately determine the problem’s root cause and plan a course of action from there.
Automated Incident Detection, Response, and Resolution
When you’ve determined the root cause and established a solution, you’ll want to use workflow automations to speed the resolution process. Solutions like xMatters Flow Designer help your team respond quickly. Each automation can tackle the full flow’s complexity, ranging from detecting the incident to providing data to help identify a response and establish a resolution. Automation enables DevOps and SRE teams to begin fixing the issue immediately. It reduces MTTR and optimizes system and application reliability.
A solution like xMatters helps you get a better view of your MTTR along with detecting, handling, and mitigating an issue. When there is an equipment or parts failure, automated workflows can inform the response teams, integrate ordering spare parts or a replacement unit, document as much as possible, and help perform postmortems.
Rely on Information
Documentation is critical to understanding and preventing these kinds of failures in the future. You’ll want to document the steps your team took to finding and enacting the solution, as well as the possible root cause.
During your postmortem, you should also consider what steps your team took to contribute to a shorter resolution time (MTTR). Finally, you’ll want to identify what process was responsible for the outage and who took ownership of the resolution. These actions will help prevent this outage from occurring again in the future.
Don’t forget MTTF mainly reflects failures due to components that you can’t repair and need to replace, so work toward a planned replacement schedule. For example, you can replace a disk in your data center every month, on a rotational basis, as a proactive measure.
To reduce the impact of failures, consider optimizing the hardware’s redundancy. For example, redundant disk configurations (like redundant array of independent disks (RAID) sets or redundant volumes), multiple network interface controllers (NICs) in a server, redundant switches, highly-available firewall appliances, and more all help avoid or minimize outages caused by component failures.
Be sure to understand the value of an accurate MTTF to avoid or minimize outages, but don’t look at it as an individual metric. MTTF alone isn’t as helpful to organizations as it is when combined with mean time between failures (MTBF) and mean time to repair (MTTR). An entire warehouse of spare parts is useless if no response team is available to replace the broken components quickly. Sometimes, replacing the failed components is not enough to get the systems and applications back to a fully working, healthy state. Combining MTTF with MTTR helps reduce repair time as you replace parts before failure.
Now that you know more about MTTF, you can explore ways to improve this metric in your own systems. To learn more about the importance of monitoring and how workflow automation helps extend your MTTF, sign up for an xMatters demo.