Network Performance Monitoring Is Only Step One
Incident response aims to identify, limit, and mitigate an incident. Whether such an occurrence is a security breach or a hardware failure, formulating and continuously strengthening an incident response strategy has become vital for all businesses in the digital age. Your incident response strategy consists of the processes your organization takes to handle incidents—such as network outages and service-impacting bugs—and the steps taken to mitigate incidents.
One strategy for identifying incidents related to your network is to implement network performance monitoring (NPM). NPM enables you to measure, monitor, visualize, and report on the quality of your network as your end users experience it. NPM tools allow network admins and analysts to track metrics from components such as switches, servers, and routers. The metrics analyzed include device or system logs, flow data, and packet data.
While NPM tools help understand the state of your network, using just NPM tools isn’t a sufficient incident response strategy. They can help you identify issues with your network both proactively and in real time, but they won’t help you solve them.
Networking performance monitoring may be the frontline of an incident, but it’s of little use without a support structure that allows for effective and timely incident response. This article explores how you can leverage NPM to support the development of an effective incident response strategy.
Limitations of Network Performance Monitoring
At best, NPM systems can provide insight into the status of the different metrics that communicate the health and functionality of your network. While these metrics help identify the presence of an issue, they won’t necessarily help you solve it.
Deciphering what each metric means in the context of business operations, determining the root cause of the issue, and delivering an intervention, remain the prerogative of an Incident Commander. With these metrics, you still need the critical thinking and decision-making skills of trained team members to take action.
An NPM and the variety of data it provides can be a supportive addition to a well-trained incident response team. Incident response systems should always accompany NPM tools to realize the true value.
Leveraging Network Performance Monitoring
A sound network performance monitoring system can track various metrics that can, in turn, help formulate an effective incident response. Let’s explore these metrics and the value of each.
Bandwidth utilization measures the maximum data transmission rate over a network. If the bandwidth utilization approaches 100 percent, this can cause problems with uploading and downloading data for applications. These issues are prevalent and often generate the bulk of tickets for network admins.
An NPM tool can generate alerts based on spikes in utilization. You can then further diagnose the timing and severity to resolve the incident and prevent new ones from arising. The incident response may include:
- Blocking high utilization and non-critical websites
- Implementing access restrictions
- Allocating network limits for each user
The measure that tracks network uptime is a key metric. NPM tools can generate alerts for incident response teams, allowing them to take immediate action whenever the network is down for longer than a set duration. Based on the root cause, the incident response plan might include checking for malfunctioning routers, switches, and cabling.
Packet loss refers to the number of packets that send successfully but never arrive at the destination. Packet loss can be measured via synthetic monitoring, where the system creates and sends synthetic data packets. The system then calculates the number of packets arriving at the destination and gauges the packet loss. The NPM can identify and alert the point where network loss is occurring.
Pinpointing the location of packet loss enables incident response teams to zero in on the problem more quickly, allowing for a quicker fix. Resolution may include:
- Upgrading hardware
- Checking and resolving bandwidth bottlenecks
- Implementing software upgrades
Latency is any delay in communication caused by an anomaly in the network. It’s measured between two endpoints, such as a host and server. NPM tools can measure the latency and identify if the cause is routing issues, long-distance attenuation, storage authorization, or propagation delays. An Incident Commander can assess the root cause and relay it to support engineers to implement an array of fixes, such as:
- Upgrading a router
- Checking internet speed and bandwidth utilization
- Reducing the distance between the system hosting the application and the router
The variation in the arrival of packets caused by network congestion, changes in routing, or inconsistent latency is called jitter. Jitter is notable during audio or video calls where the sound or video is perceivably garbled. NPMs can collect logs of instances of jitter. Studying the logs can help the incident response team know if the jitter is intermittent or happens at a particular time of the day (e.g., peak working hours).
Examining metrics in context with other variables can help the incident response team take several corrective actions like limiting bandwidth use for critical applications, installing a jitter buffer, or switching the network provider.
Implementing Incident Response
When used in tandem with each other, the above metrics can help provide a precise and holistic overview of your network. With this information in tow, you can implement NPM into your incident response strategy to identify and resolve incidents—reactively and proactively.
Establish Your Incident Response Team
The first step is always to establish your incident response team. This team will consist of communication specialists, subject matter specialists, and designated leaders. When monitoring your network, ensure that your team contains individuals and leads directly involved in your network’s security, such as cybersecurity and supply chain leads.
Determine Points of Weakness
Once you have established your incident response team, you can invite all stakeholders, network specialists, and the incident response team to jot down all digital systems and scenarios that could act as single points of failure, bottlenecks, or general vulnerability within your network. Working collectively to determine potential network vulnerabilities enables you to proactively investigate concerns before they grow into issues.
Select Your NPM Tools
Once you’ve determined the points of weakness in your network, you can select NPM tools to help you monitor different aspects of—and vulnerabilities in—your network. NPM tools can detect threats by employing a variety of techniques:
- Device monitoring: Monitoring device and network uptimes can help measure system health. If variation in uptimes will give visibility to your response team to act upon. Additionally, device monitoring helps detection of faulty hardware, enabling early proactive replacements or repairs before the impact reaches the end users. Network admins can also zero down on any device chokepoints and optimize throughputs as needed.
- Application workflow analysis: Ensuring that the application workflows are working as expected by building validation checks between endpoints can help measure system health. If a workflow is not operating as expected, it could potentially mean that someone has accessed the network and tampered with the application configuration. However, network breaches are not the only possibility. The issue could also be indicative of sub-optimal network design or data errors.
Develop a Mitigation Strategy
Once you’ve installed your NPM tools and have determined how to monitor their metrics, work with the stakeholders to map the steps required to mitigate each threat. Be as detailed and action-oriented as possible.
Assign owners and service level agreements (SLAs) for each, draw up the escalation matrix if the SLA isn’t met, set up thresholds, and automate alerts. By using NPM metrics, you can create an incident response strategy that relies on your network’s performance indicators so that you can more easily identify and resolve issues impacting your network.
Say you’re monitoring your NPM metrics, and suddenly you detect high bandwidth utilization and a surge in packet loss. You suspect a breach. Luckily, these NPM tools (and the metrics they provide) pinpoint that the breach is coming from an unrecognized system accessing one of the routers. With this information, you can immediately disable the router and call the area security. By monitoring NPM, you can react faster and minimize damage.
Then, you can use your NPM tools to analyze the packets and identify what network components were compromised, including highly restricted data. Because you can determine where the attack happened, you can better understand the vulnerability and prevent future recurrences.
Benefits of a Robust System
Using a robust, NPM-driven incident response system has several benefits. This combination enables you to quickly identify and resolve network performance issues and outages while also helping you prevent similar problems from reoccurring.
In addition to these larger benefits, using an NPM-driven incident response strategy improves:
- Aggregation: Incident management systems can obtain metrics from NPM tools and aggregate the inputs in context with other data points, such as user tickets and in-person reports.
- Channeling: Multiple streams of variable data can flow to the right personnel for analysis, and issues can be flagged to the right teams.
- User satisfaction: Close-knit integration of NPM and information management systems (helps reduce the ticket resolution time multifold, increasing user satisfaction and trust in the IT team.
- Employee satisfaction: A better ticket resolution rate and reduction in the lead to enhanced employee satisfaction and work-life balance.
- Automation: The bidirectional data flow between the IMS and NPM can help automate alerts and build prescriptive and predictive analytics.
NPM can help identify vulnerabilities but it can’t resolve the problems independently. You must augment it with a response plan to glean real, tangible security benefits. NPM can provide the analytics, but the real value comes from an integrated response system, which further processes the data points fed by the NPM tool.
An effective incident response strategy that uses the insights that NPM provides will improve the overall efficacy of your development and DevOps efforts. It also helps you to keep your system secure and restores service if there are hiccups, interruptions, or outages. NPM is just the start of an organization’s response system. It’s not the whole solution.
xMatters automates operations workflows and ensures that your applications work around the clock. Learn more about automation with xMatters and request a demo to see how you can improve your incident response strategy.