Root Cause Analysis in Troubleshooting

“For every effect there is a root cause. Find and address the root cause rather than try to fix the effect, as there is no end to the latter.”

We have, in earlier expositions, looked at the broad principles of troubleshooting. Root Cause Analysis in Troubleshooting is being covered separately since without knowing what the ‘root cause’ is of the problem, the resolution would be impossible or at best hit and trial. Finding the root cause for total failures is fairly easier than analyzing intermittent failures since the cause of the problem occurs maybe once or twice. Very often even experienced troubleshooters fail to find the cause of intermittent problems and generic problem solving methods also seem ineffective resulting in these problems being overlooked. However, these failures need to be analyzed too to prevent them from causing complete failures and shutdown of the system of equipment. This is where Root Cause Analysis becomes extremely vital.

This process is specially designed to ‘root out’ these ‘intermittent irritants.’ Root cause analysis in troubleshooting follows a process that is defined by:

– Understanding of the significant event that occurs causing a failure
– Defining the actual problem and collecting data and information relating to the event
– Analysis of the tasks that were done just before the failure occurred
– Analysis of the changes that have occurred
– Elimination of possible causes to narrow down on one or two possible causes
– Determine the actual or root cause or causes
– List and suggest corrective actions
– Document steps and report on the end solution

Analysis of the events or tasks undertaken just before the event helps to understand how the task or tasks should be done as opposed to how they were done. This forms the premise for what actually happened. This preliminary analysis serves as a base for understanding the changes that occurred due to the incorrect tasks as opposed to the tasks been done correctly. The troubleshooter would need to ask accurate and leading questions when speaking with the user.

The troubleshooting staff would also need to analyze why the controls of the system or equipment did not work. The analysis of why, the barriers that are meant to control breakdowns, did not work. This analysis helps by determining whether these barriers are not working correctly, are absent or need to be revamped. Root Cause analysis in troubleshooting is a critical step and each process within this step needs to be clearly defined and understood before being able to establish the actual root cause.

After identifying the events, tasks and control barrier failures, it is vital to prepare a flow chart detailing all of these together. Having all the information as a flow chart makes it easier to identify and establish a cause that triggered the failure. This in turn makes it simpler to ascertain the root cause. Apart from this flow chart, the troubleshooter must be able to elicit answers to causal questions from the user of the system or equipment. The answers to the questions would help to determine whether the cause of failure was human error and or machine failure. Many times the user is reluctant to provide accurate information, especially if he or she thinks or knows that it was probably incorrect usage or commands that led to a failure. These questions and answers too need to be documented accurately as they would play an important role in substantiating the troubleshooter’s end conclusion and findings and also help to find the root cause and conclude the Root Cause Analysis step in troubleshooting.

The toughest part is conclusively determining ‘the’ root cause. An incorrectly determined root cause will have a solution that is flawed and would probably fix a symptom but not the problem. An incorrect solution could also create further problems or aggravate the problem at hand. After going through the arduous steps of troubleshooting, coming up with a solution that does not work and worse still, compounds the problem, is frustrating and causes even more damage in terms of lost time, money, effort and maybe loss in revenue. The troubleshooting person must be absolutely certain that the cause being tackled is actually the root cause or else must fall back on previous documentation and also consult with other troubleshooters.

Once all concerned are satisfied that the root cause has been isolated, taking the necessary corrective actions is simpler. The troubleshooting staff would be more confident in implementing the solution if all the data and information collected is accurate after being checked and re-checked. If possible, before implementing the solution in the live scenario, it must be implemented in the testing environment to be completely sure that it will work.
Documentation at every step of troubleshooting is imperative. In the Root Cause Analysis too, the final report must be detailed and provide accurate statements and recommendations. This step of troubleshooting is vital to determined failures of processes, technology and also human failures.

Root Cause Analysis leads to the understanding of constraints within the work environment too. Understanding constraints will lead to knowing the bottlenecks or stumbling blocks are in finding solutions. Removing or decreasing these bottlenecks will increase the capacity of the whole system leading to fewer breakdowns and or intermittent failures. Knowing the constraints will help pinpoint which process or component or a combination of these is the biggest stumbling block or bottleneck and these can be removed. For example – if a household has at one time the refrigerator, TV and Air-conditioner on and in addition, the washing machine is turned on too, it is possible that the washing machine will give an error signal, especially if the voltage is low. The bottleneck for the washing machine not working is that the other appliances have taken up most of the voltage. There is nothing wrong with the machine, but the system or process as a whole has a bottleneck that is preventing it from working altogether or stopping intermittently. Bottlenecks have a way of showing up and an experienced troubleshooter would be able to identify them.

These bottlenecks sometimes need discovering. For example if a user complains of the system being slow, the troubleshooter would need to check if the problem was with the one PC or with the whole system. By taking certain measures, the troubleshooting person would be able to determine if the bottleneck is in the individual system or the bottleneck is elsewhere. Inability to discover this constraint, could lead to a major failure of a system crash over time. Finding the bottlenecks is not the only thing this analysis also helps to find other alternatives to removing the bottlenecks.

In an organization when system or equipment breakdowns happen, there is always more than one way to approach the solution. Each approach would most certainly have resource utilization and expenditure attached to it and it becomes a leadership decision as to the kind of approach to take. Sometimes a quick and less expensive fix is decided upon considering the time constraints in an organizational scenario. Despite a detailed root cause analysis, the tendency in such cases is to treat the symptom and this could prove detrimental in the long run. It is easily understood that not treating the fundamental and underlying cause is bound to make the problem reoccur and possibly bring with it a new set of symptoms, problems and additional expenses. After determining the root cause and getting recommendations on the best possible solutions, it becomes a matter of policy and discretion of the decision makers whether it will cost more to attack the root cause or just remove the symptoms continually. The cost of tackling each is also tied in with the not calculated costs of employee and or customer satisfaction. It seems prudent to deal with and do away with the root cause if these not calculated and long term costs will probably add up to be greater than the cost of dealing with the root cause.

Learn about a new approach to better customer service!

Interactive Guides for Superior Customer Service

Develop interactive decision trees for troubleshooting, call flow scripts, medical appointments, or process automation. Enhance sales performance and customer retention across your call centers. Lower costs with customer self-service.

Interactive Decision Tree