As networks have grown more complex, it’s become a huge challenge for communications service providers (CSPs) to continue to operate their networks more efficiently. Network devices and external polling systems emit a continuous stream of alarms, inundating CSP network operations centers (NOCs) with hundreds of thousands of alerts each day. NOCs traditionally employ static rules and filters to try to reduce the number of alerts their staffs must respond to. These static filters have three main drawbacks:
- The filters are difficult to manage since every time a new piece of equipment is added to the network, the rules must be reviewed and updated.
- Non-critical or minor-severity alerts are often discarded or filtered despite them being leading indicators that a more serious event may be developing.
- The alarm volume after these filters is often still higher than a reasonably sized NOC staff can handle.
A better approach to alarm management is to apply machine learning (ML) to automatically identify which alerts are most likely to lead to a network incident and to highlight those alerts for the network operations staff. This allows NOC staff to prioritize their workload in order to achieve the best customer experience at all times. Through the ML techniques described below, some of the CSPs we’ve worked with have been able to reduce alarm volume by more than 90%, while providing increased visibility into customer-impacting issues developing in their network.
ML for Real-time Incident Prediction
A strong use case for alarm prioritization using ML is alarm classification based on incident prediction. In this case, historical alarm data that’s been labeled with incidents that have been opened in the CSP’s trouble ticket system is used to train a ML predictive model to identify new alarms in a real-time stream that are likely to have tickets opened. Once the model is trained, reinforcement learning can be used to keep the model up-to-date based on the actions NOC operators take with classified alarms, allowing the model to adapt over time as the network architecture evolves.
Those alarms classified as likely to have a ticket opened against them can be sub-classified further — in the initial use case, this sub-classification identifies alarms as either being caused by an unplanned event, or by a known maintenance ticket. In a blind test with a large cable operator, our Guavus analytics model was able to properly classify 81.5% of the alarms that would eventually have an unplanned trouble ticket associated with them, as well as 98.5% of the tickets that would not. The accuracy for associating alarms with known maintenance events was 92.7%.
One benefit of this ML approach is the threshold for the confidence of the classification can be set based on the business objectives of the operations organization. In the case of alarms likely to lead to an incident, the confidence threshold may be set relatively low to ensure few false negative predictions. While this may lead to more alarms being classified as likely to lead to an incident, there’s still a substantial reduction in the number of alarms NOC staff must focus on while ensuring important alarms are not dismissed prematurely.
As networks become more complex and operational efficiency becomes even more of an issue, machine analytics is a natural fit for network operations organizations. Alarm prioritization is a logical first step for NOC systems to apply machine analytics since it can be incorporated in the event management system with minimal changes to the NOC systems and procedures. For example, with Guavus’ Alarm IQ, the classification of alarms is fed back into the existing event management system as updates to the alarms and the NOC system manager can choose what fields to update based on the specific EMS implementation and schema for that operator. The updates are reflected on the NOC operator’s existing events list, allowing implementation with minimal staff disruption and training. The EMS system simply gets smarter about which alarms are important to the NOC staff.
Find out how Guavus Alarm IQ, advanced analytics software, can help your organization prioritize network alarms based on customer impact.
As originally published September 1, 2018 in Broadband Library