Root issue analysis
Imagine this. You work in the Network Operations Control (NOC) as a senior manager. You drive in to work, sit down at your desk and login. By the time the system authenticates you and you’re up and running, you have 5000 alarms staring you in the face. Of those, 700 are listed as critical. Not a good start. You’re going to need a strong coffee. You get yourself a cup and when you get back, you have 10,000 alarms, with 1500 listed as critical. Today’s going to be a long day. Actually, every day is like this…
NOC teams are flooded with thousands of network alarms at any given time – usually over 80,000 alarms per day, an overwhelming volume. With that volume, manually combing through alarms becomes an impossible task. There are tools in place to manage the alarms and classify them according to severity, but even with those, approximately 10-20% of alarms are listed as critical – far too many to handle. However, many alarms don’t actually lead to incidents. How do you know which ones to ignore and which ones need your attention? Which ones are really just different versions of the same alarm? Which ones are ‘sympathetic alarms’, letting you know that their ‘neighbors’ are down, rather than the actual node itself.
Since networks are made up of interconnected components, problems in one component will cause problems in another. The more time it takes to identify and fix the problems in the network the greater the impact. But which one is to blame?
Discovering the hidden dependency structure
Before we can figure this out, we need to group alarms by issue type or related issues in order to reduce the number of alarms that must be dealt with. We also need to understand the relationship between the alarms, such as which one is the parent alarm and which one is the associated child alarm, and understand the extended familial structure; one parent can have multiple children and a child can also be a parent and so on. Once the parent, or root, alarms are clearly identified, alarm severity and criticality can be properly assessed and addressed. Rather than wasting time fixing the children (dependent) alarms, we can focus on the parent alarms that are the true issues. Fix the main problem and rest will take care of itself.
Applying machine learning Bayesian network analysis
Each network alarm is characterized by a unique identifier or set of features that capture behavioral attributes such as: severity of impact, location attributes (e.g. network topology) and device attributes (e.g. device id, model, make, type, firmware, etc.). Typically, technicians diagnose a problem in the network by looking beyond an alarm instance at the corresponding issue and the connections across the issues, based on their knowledge of the network and experience. However this is not scaleable as the number of alarms and corresponding issues continues to grow, making this humanly impossible to track. The ability to quickly diagnose the problem also varies tremendously depending on the level and experience of each technician.
This is where machine learning, particularly unsupervised machine learning can help. First, we reduce the total number of alarms by grouping them according to which alarms are related to the same issues. Alarms related to the same issues are classified as a ‘tuple.’
Next, we must find the relationship across the issues (tuples). Bayesian network analysis provides a useful framework to determine the relationship between these issues, using observed alarm data. Who is the parent? Who is the child? How many children does the parent have? Do they have siblings? We also want to determine the strength of the relationship between these issues. How likely is one issue to cause another?
The structure of the Bayesian network graph helps us identify the connections between all issues. It identifies which issues are the root issues and which issues are the children. We also quantify the strength of the connections (relationships) between the issues using the conditional probability tables associated with each issue. Based on the strength of the relationships, the probability of issues occurring can be determined, alarms can be classified as critical or not, and the NOC team’s focus can be prioritized.
As an example, Guavus took a sample of 104,797 alarm instances and applied machine learning algorithms to the data set. The machine reduced the 104,797 alarm instances to 315 unique alarm tuples.
From there, we applied Bayesian networks algorithms to group these alarms into families. We analyzed these families and found that of the 315 unique tuples, 169 of them could be reduced to 30 multi-tuple alarm families. (The remaining 146 tuples were singletons of only one tuple family).
The results were dramatic: 104,797 alarms instances were reduced to 176 issues (30 alarm families and 146 singletons) – a much more manageable set.
The business value
As you can see in the example above, Guavus found that by applying root issue analysis, we were able to drastically reduce the number of network alarms, reveal the hidden structure of issues and clearly identify the strength of the relationships between the issues. All of this information will allow NOC teams to rapidly identify alarms that represent the true root issue and not simply secondary symptoms. Faster identification of real problems leads to faster resolution. NOC teams will be able to focus only on truly critical alarms and have the visibility they need to take the right course of action – fast.
Explanation of Directed Acyclic Graph: A graph has nodes and edges that connect two nodes.
Directed graph: The connection between two nodes (a parent and a child) represents a relationship with a direction.
Acyclic graph: There are no cycles or loops in the graph.
Image attribution: bigstockphoto.com