Today, some of the main challenges in NOC management, described in the following diagram, are:
Troubleshooting billions of service alarms
Processing of around 20 million workflow management notifications by NOC experts.
Manage millions of call center emails
Higher costs due to low use of workflow management.
Incident management is an area where we already use specialized system structures. However, the constantly evolving nature of networks, both technologically and for their implementation, makes it very difficult to maintain handwritten rules in specialized systems. Automated incident management independent of a data-controlled domain, without the need for specific regulations, would greatly enhance automation of NOCs. For example, a failure on one node can cause cascading failures on other nodes, resulting in a series of alarms. Machine learning techniques allow us to discover contemporary patterns in a sequence of signals and other events, allowing us to quickly identify the root cause in most failure scenarios. This frees up the noc team so they can focus on more complex challenges.
What kind of complexity does this imply?
Typical handling of NOC alarms involves mapping the received signals for incidents using enrichment, aggregation, deduplication, and correlation techniques. This is a challenge due to the heterogeneity of the alarm information caused by the solutions of various technologies and various providers used in current telecommunication networks. This heterogeneity makes it difficult to create a harmonized view of the system and greatly increases the complexity associated with fault detection and resolution.
Can we afford to encode domain knowledge long-term?
Current NOC solutions include rule-based alarm management from different sources, such as nodes or service management systems or element / network management systems. The rules are written in such a way that they convert domain-specific information into an overview of the network in the NOC Center and also include coding practices that handle / correlate alarms for proper grouping.
Developing this rule takes a long time and time. Continuous changes in the network with new types of network nodes and the resulting new types of alarms also complicate the development and maintenance of rules. Furthermore, the generation / update of regulations must be carried out frequently; otherwise, the rules database will be incomplete or even inaccurate.
Does this mean that we have stopped developing domain-oriented rules?
This does not mean that the development of traditional rules is disappearing, but domain independent data approaches will augment it. In addition, automatic detection of possible correlations between alarms can enhance the rule-based approach when rules are incomplete or when domain-specific knowledge has not yet been acquired.
The data-driven approach will help identify cross-domain correlations and generate data-based information. Little by little, the system can evolve towards a fully automated solution.
NOC-based data automation
We will share with you a case study on automatic incident training, root causes and self-correcting scenarios that we are working on as part of our investigation.
We apply the principles of Machine Intelligence (data mining and data science) to discover patterns of behavior in large historical data sets. These behaviors or patterns essentially signify a correlation between alarms and concurrency patterns. An exciting aspect of our approach is that we not only evaluate it as time series data, but we also examine how to deal with categorical and largely symbolic information collected online and identify latent behaviors.
This approach helps experts in the field learn unfamiliar and evolutionary behavior patterns when the environment is multi-technology and multi-vendor. These correlated and grouped models allow automatic grouping of alarms, opening the way for automatic detection of network, source and mechanical repair incidents.
With this approach, we can achieve intelligent grouping of alarms and tickets with minimal manual participation; We can reduce or completely avoid manual rule development, automatically identifying large, missing groups, and we can reduce the total number of incident tickets.
No comments:
Post a Comment