Even the most efficient maintenance teams experience equipment failures. There are different types of failures but in the most basic terms, failure means that a system, component, or a device does not longer produce the desired results.
Managing failure is essential to reduce the negative impact on businesses. Some metrics can help you to understand the types of failure and sort them in an efficient way.
The relevance of reliable data
To make improvements in equipment failure, it is crucial to collect accurate and meaningful data. These are some the inputs that should be collected as part of your maintenance history:
- Labor hours spent on maintenance.
- Number of breakdowns.
- Operational time.
The data collection process can be tedious but it is essential to improve operations. This process can be painfully time-consuming when done manually, but it has made simple with a mobile CMMS.
What is MTTR?
Mean time to repair (MTTR) is a metric used by maintenance departments to measure the average time needed to fix the failure. The MTTR calculation considers the period of time between the beginning of the incident to the time the equipment or system goes back to production. This includes the time spent:
Diagnosing the problem.
Notifying maintenance technicians.
Fixing the issue.
Reassembling and validating equipment.
Resetting, testing and starting up the equipment.
How to calculate MTTR?
To calculate the MTTR, you need to divide the total unplanned maintenance time spent on an asset by the total number of failures that asset experienced over a specific period. Mean time to repair is represented in hours. For example, if you have spent 50 hours on unplanned maintenance for an asset that has broken down seven times over the course of a year, the mean time to repair would be 7.14 hours. The MTTR depend on different factors, like the type of asset, its age, the type of failure, etc. However, a good MTTR should be under five hours.
It is important to note that the total corrective maintenance time (and, consequently, the Average Repair Time) can include the time from the moment a fault is detected to the moment when the repair work actually starts (including the time of identification, notification, diagnosis of the failure, etc.)
How to use MTTR?
Mean time to repair is used as a standard for increasing efficiency, through ways that can help limiting unplanned downtime, and boosting the bottom line. MTTR helps organizations to know why maintenance may be taking longer than usual and to keep everybody informed about failures, causes and effective solutions.
MTTR analysis provides useful insights about how to handle maintenance tasks and schedules maintenance, how to reduce missed orders and how to improve customer service.
Even though MTTR is considered reactive maintenance, tracking MTTR gives you a look into hoefficient your preventive maintenance programs are. For example, equipment with a lengthy repair time might have underlying root causes that contribute to the failure. MTTR can also help you start investigating the main cause of failures and get you on your way to find a solution.
MTTR analysis is also helpful to take decisions such as whether to repair or replace an asset. If a piece of equipment takes longer to repair as it gets older, it might be a better option to replace it. MTTR history can also be used to predict lifecycle costs of new equipment or systems.
How to reduce the MTTR?
To reduce the MTTR you will necessarily have to evaluate and try to reduce both factors (the total corrective maintenance time and the number of repairs). Although, mathematically, reducing the number of repairs will not reduce the MTTR (if the total corrective maintenance time remains the same), in reality, reducing the total number of repairs will lead to a reduction of the total maintenance time.
These are some good practices to help teams reduce MTTR and improve incident response game:
Create a robust incident-management action plan
Teams need a clear policy that explains what to do if something breaks: whom to call, how to document what is happening, and how to sort problems. Some companies assign semi-permanent incident commander roles, while others rotate staff members into the role.
Train you team
Cross-training will help you to avoid one of the most dangerous incident-response risks: situations in which one person is the only one able to manage certain type of system or technology. If that person goes on holidays or leaves the organization, nobody else will have the skills or the knowledge to fix the problems
Leverage AIOps capabilities to detect, and resolve incidents faster
AIOps (Artificial Intelligence for IT Operations) complements your monitoring practices by providing an intelligent feed of incident information. If you use that information to analyze and take action on that data, you will be better prepared for troubleshooting and incident resolution. Additionally, you may want to out into operation a don’t-repeat-incidents (DRI) process, which involves stopping new work on a service involved in an incident until you fix or mitigate the causes. This reinforces the commitment to resolve issues rather than accepting short-term fixes.
You can have a troubleshooting process without monitoring data. If that is the case, the response team is forced to diagnose and solve the problem with a heavy dose of guesswork.
If you monitor data in real time flows, you can give your team an accurate information so they can formulate a theory about what is causing a problem and how to fix it using facts rather than guesswork.
Carefully calibrate your alerting tools
With all the monitoring tools available today, it is possible to have too much information about your systems, which can make it difficult to develop a clear plan for how to use the data. This is where programmatic alerting becomes critical.
A practical first step is to set alerts in the form of thresholds for service level indicators (SLIs). These are simple metrics or thresholds you can track with automated monitoring tools, and which indicate when a serious problem might be happening or is about to happen.
Each incident may require a technical lead and a communications lead, each of whom reports to the incident commander. The technical lead, typically dictates the specific technical response to a given incident. In some cases, you may need more than one technical lead, depending on how many systems are impacted.
How to reduce the total time of corrective maintenance?
The total maintenance time goes from the minute the fault is detected to its resolution. The team should not waste too much time to report the fault to the technician in charge. However, if the repair time is too long, that could be an indicator that certain equipment needs to be replaced.
These are some strategies to reduce corrective maintenance time and allow your organization to operate more efficiently.
Strategies to improve accessibility should be executed during the design process of setting up a facility. Alternatively, modifications to the arrangement of assets can be done after identifying an opportunity to improve accessibility.
Agreeably, the most challenging strategy to reduce corrective maintenance time is to improve workers’ skills in a way they could enhance their ability to recognize, locate, and isolate faults. Training workers and providing them with the technical ability to perform maintenance activities well seems to be the solution.
When replacing equipment parts, it helps to know what components can be physically or functionally interchanged without compromising the integrity of the asset. Interchangeable parts are particularly common with hardware items such as nozzles, hoses, fastening devices, common valves, etc. Improving interchangeability can reduce the corrective maintenance time by increasing the ease of removing and replacing similar components.
Account for human limitations
Corrective maintenance tasks are mainly performed by workers, so it is necessary to take into account human limitations when designing how assets and equipment are arranged. A big part of this strategy is to consider the ergonomics of the workplace.
How to reduce the number of repair actions?
If an operator notices that a particular machine or equipment is under-performing, a service technician should be called immediately. Acting fast can prevent a total malfunctioning of the asset and a complete shutdown of production. In the case of an emergency, the technical operations manager should be available to give a quick response and to review the protocol with the team.
Almost all equipment requires periodic revisions. In the case of heavy-duty machines, replacing some mechanical components regularly (often annually) is highly recommended. By using a good CMMS, you can even automate these notifications, ensuring that all maintenance is up to date and reducing the number of repair actions. By automating these notifications, it is easier to ensure that all maintenance is in order and you can reduce the number of repair actions.
Keep in mind that MTTR is important, but it is not the only metric. To get good results in the long term, you should put in place a strategy that coordinates a continuous stream of real-time data, alert policies, and tools to support incident-management processes. This is the best formula for systematically and efficiently resolving incidents. It’s also the best way to continuously reduce MTTR.