After the outage

Technology

Matt Tuson at LogicMonitor explains the importance of hybrid observability in IT recovery

Today’s business IT environments are becoming increasingly complex, with the integration of multi-cloud environments and on-premises systems into organisational infrastructure. However, this influx of systems increases the risk of error, especially system outages.

Outages present a significant threat to operational continuity, but the impacts extend far beyond financial losses. In fact, business leaders can expect the fall-out from outages to include supply chain disruptions, production losses, and significant compliance vulnerabilities. Recent high-profile technology failures across the UK, Spain and elsewhere in Europe, such as Finland, and before that, London’s Heathrow Airport, demonstrate just how devastating these events can be.

When Heathrow experienced a significant blackout in March 2025, the impact was felt globally. Thousands of flights were cancelled,, with many others rerouted, causing customer uproar. Even more severe was the April 2025 power grid failure across Spain and Portugal which resulted in chaos; shops, travel networks and offices were forced to close, and 55 million people found their homes without power for up to 23 hours. The economic damage caused by this event is still being calculated.

These may seem extreme examples, but also consider the host of outages recently across major UK banks. An inquest into recent IT failures at UK banks and building societies revealed the sector has experienced at least 803 hours of downtime over the past two years.

On top of this, banks have had to pay customers millions of pounds in compensation to cover the disruption and utter inconvenience the outages have caused. This proves that outages damage businesses both financially and reputationally, with businesses consequently needing to offer customers payouts to save their business and avoid customer loss.

The recovery dilemma

Not every outage is avoidable. Some have no clear cause – I was in Spain at the time at the time of the outage and at the time of writing, there’s no confirmation of what led to the power failure, – what we do know is that the speed at which an organisation, is able to return to business directly correlates with its ability to minimise damage.

However, IT leaders face a very real dilemma: push hard and restore systems prematurely (potentially), thereby risking even more destructive secondary failures, or extend downtime to ensure thorough recovery while watching financial losses mount?

An infamous example of this comes from July 2024, when CrowdStrike released a routine software update that contained an unexpected bug, crashing 8.5 million computers worldwide. The outage affected hundreds of businesses and led to huge financial losses.

That was bad enough, but the situation was compounded when, after systems appeared to have recovered, they failed again, causing even more disruption. This further disruption posed legal ramifications due to the widespread disruption the outages caused globally. For example, Delta Airways sued CrowdStrike for £388m after the outage resulted in 7000 cancelled flights.

Hybrid observability: the foundation of recovery

To support businesses in confidently navigating the recovery process, IT teams need to have a comprehensive oversight of their entire tech stack. Today’s hybrid and multi-cloud environments mean traditional monitoring is becoming obsolete. Enter hybrid observability, a software that has emerged as essential for effective outage recovery.

Unlike traditional monitoring tools, hybrid observability provides a holistic, single pane-of-glass view across all infrastructure, applications and network components, meaning system recovery can be validated before normal service is resumed. The technology can further detect hidden interdependencies that could trigger cascading failures and identify performance bottlenecks.

Predictive intelligence: beyond reactive recovery

When observability is driven by AI, its power is supercharged to incorporate predictive capabilities that include identifying patterns and trends in data. Through this, businesses are able to spot, act fast and correct secondary faults across its systems, as opposed to acting after the fact. This is key for recovery and means that organisations know exactly when they are able to announce a return to business as usual.

For example, an AI-powered observability platform might detect patterns to suggest a database will experience connection problems under full production load, even when a basic connectivity test has succeeded.

When we discuss agentic AI with respect to observability, we begin to look at the next stage of outage recovery management. During recovery operations, we know that IT teams face immense pressure, often working overtime to restore critical systems. To ensure the IT infrastructure is robust enough for operations to return to normal, teams also need to focus on strategic initiatives, such as communicating with employees and maintaining business continuity. Under these conditions, human error can easily creep in.

Time is also of the essence when it comes to outage recovery. If the root cause of the incident isn’t identified quickly enough, the fault will not be quashed in time, and the likelihood of further damage will escalate. Agentic AI’s analysis capabilities enable businesses to identify what caused the outage in the first place rapidly, meaning IT teams can restore and monitor the issue before it spirals.

Agentic AI also supports teams by understanding and autonomously handling technical tasks, such as simplifying alerts into summaries and retrieving step-by-step troubleshooting guidance. While analysing business data and adapting to new situations, it can perform tasks with minimal human intervention, thereby allowing IT workers the time to concentrate on work that necessitates their human expertise. More than that, it clusters related alerts into a single, coherent incident.

Teams can engage with the AI to understand root causes, gain impact analysis, and gather recommended next steps. Naturally, this further improves MTTR.

Building resilience is essential

Businesses can’t always predict when an outage will occur. While observability can help mitigate risks in most situations, some factors remain beyond an organisation’s control, and therefore, forward planning is critical.

By investing in solutions like agentic AI and hybrid observability, businesses can provide their IT teams with the support and confidence they need during their most challenging operational moments.

Matt Tuson is General Manager EMEA of LogicMonitor

Main image courtesy of iStockPhoto.com and Evgeny Gromov