Guavus Blog

All the latest from the world of Guavus

Advanced Analytics for Improved System Availability

by Chris Neisinger, Field CTO, Guavus, Inc.

The mobile industry is in transition as new network architectures such as NFV and radio access technologies, including multi RAT, C-RAN and 5G, introduce complex operational requirements.  Mobile operators are looking for simplified, cost-efficient operational models.  Advanced analytics solutions utilizing machine intelligence and AI enable intelligent automation and simultaneously improve system availability.

System availability is improved through (1) elimination or minimization of the number of outages, and (2) reduction of MTTD and MTTR.  The first method, (1), is achieved through effective preventative maintenance and operational excellence.  Great ops teams focus on and deliver outage prevention.  However outages and incidents still happen.  The second method, (2), the quick detection, diagnosis and recovery from incidents, is best achieved via advanced analytics.

Improved Mean Time to Detect (MTTD)

Analytics-based operations significantly improve MTTD through the use of machine learning based anomaly detection.  This method is vastly superior to traditional alarm threshold methods based on static key performance indicators (KPIs).

Using machine learning, millions of network, telemetry and customer events per day are ingested and correlated from multiple disparate sources such as network alarms, trouble tickets, and network performance data.

Advanced analytics systems use algorithms to analyze the millions of time series based on collected events and contextualize these events with related attributes.  This allows the operator to automatically establish and maintain computed patterns and baselines without having to manually set thresholds and KPIs.

Using advanced machine learning, the system automatically identifies anomalies in the baselines that lead to incidents, routinely uncovering what standard threshold alerting systems fail to detect.    And since the thresholds are not manually set, the baselines automatically adjust as the datasets evolve, reducing the need for staff to continuously reprogram the system as network conditions change.  This frees up the operations staff from continuously reprogramming thresholds and allows them to focus on actually resolving the anomalies detected.

Reduce Mean Time to Understand (MTTU) and Determine the Root Issue

Once the incidents have been detected, the system must contextualize and group the incidents by commonalities.  This enables the operations staff to quickly understand the current outage conditions and begin the repair process.

Individual “events” are often the result of a separate event that spawns a series or combination of faults.  A shotgun troubleshooting approach rarely addresses the root issue on the first round of troubleshooting.  Systems need to start with all of the raw data, then transform and group events by commonalities.  Time-dependent Bayesian network analysis and probabilistic directed graphs indicate families of incidents where the parent point on the graph is the likely root issue.

Using these advanced techniques and behavior analysis, root incidents can be predicted in real-time, significantly improving the time to diagnose the outage.

This provides the operations team not only with early detection, but with the network element commonalities and a determination of the most likely root issue, identifying which event has spawned the cluster of events.  The operations personnel can now quickly focus on the root issue.  Subtending faults often self-clear once the root issue is resolved, making identification and resolution of the “root issue” the fastest way to restore service.

Faster Detection + Root Incident Prediction = Better System Availability

The sooner an event is detected, the sooner the operations teams can troubleshoot and restore service.  Machine learning and AI based systems provide the information needed to predicted the root incident and allow the operations team to focus on understanding the failure.  Once the root issue is understood, repair and restoration can begin.

The goal of the operations team is simple – provide maximum uptime and excellent service. “Don’t let it fail, but when it does…. detect early; diagnose immediately; get to work and restore service.”

Guavus enables network operations team to be successful in today’s dynamic, fast-paced world.  Leveraging machine learning and AI found in our advanced analytics, we automatically detect anomalies in operational data, no matter what the type.  Taking in and processing billions of events per day, we contextualize these events with related attributes, and automatically detect deviations.  Then we dive in further to establish the root issue of each deviation so that operators can identify true issues faster and get services restored asap.  In this way, Guavus uses advanced analytics to support operators in their quest for operational excellence.  Click here to learn more.

Click here for last week’s blog

Image attribution: