ao link
Business Reporter
Business Reporter
Business Reporter
Search Business Report
My Account
Remember Login
My Account
Remember Login

The future of operations in the age of AI

João Freitas at PagerDuty argues that maintaining AI Agents and LLMs requires a new Operations Playbook

AI adoption is accelerating, bringing not only new ways of working, but also new failure modes. Thus, as new challenges arise, reliability and resilience remain major operational challenges, just as it does across digital operations, including DevOps, ITOps and AIOps.

 

As AI-driven incident management arises as a new category, incidents such as the global IT outage of July 2024 continue to happen, catching companies off guard. Even in such novel incidents, well-designed automation and AI capabilities can mitigate the scope of failure with an improved and a more efficient response. Greater context and less noise for responders is just the tip of the iceberg of how an AI-driven incident management framework can change the game. However, making AI an intrinsic part of the operations playbook is a journey only just starting for many companies. 

 

Reliability continues to be a pillar, and a key part of customer service experience. Traditional incident management frameworks are not thought to handle the non-deterministic behaviour of LLMs and AI agents. Hence, in a world where everyone is building and using AI agents, the way we respond to incidents also needs to evolve. For example, how do we detect and respond to hallucinations or an agentic workflow that is stuck in a loop?

 

 

The limits of digital operations in the age of AI

Traditional DevOps and MLOps solutions were architected and built taking into consideration deterministic systems prior to the widespread adoption of GenAI models.

 

These LLMs and autonomous agents introduce variability, context-dependence and unpredictability in workflows while also increasing risk. For example, imagine a customer support agent that performs inconsistently depending on prompt nuance or unseen data. Following a poor customer experience or logging a fault, engineers need to be able to determine exactly what led to the issue.

 

While this has been going on, there’s been a progressive change in the world of digital operations. We now have operational work related to LLMs and agents that leads to new categories and disruption in existing operations such as DevOps and ITOps.

 

Despite some overlap, these are still distinct areas of operation, but each builds on the other as the enterprise increases in complexity and ability. They each have key differences in observability, deployment, version control and real-time feedback loops.

 

Organisations relying on standard monitoring tools and failing to adapt business processes to the realities of the new IT environment can fail to handle these new failure modes and face risks, such as data exfiltration or performance drift. For example, an AI-powered HR assistant may suggest outdated policy documents due to flaws within its memory management, or it could begin to misinterpret jurisdictional rules after a model update.

 

AI use cases are there, but it’s not always a smooth path to significant ROI, or widespread adoption resulting in a changed and newly hyper-productive organisation. Not without the backroom expertise of a strong operations team working with clear policies and a resilience platform that allows for rapid remediation and the ability to learn from failure.

 

 

Building a new resilience playbook

As always, there are emerging best practices as the realities of operations, business needs and user behaviour converge. High on the agenda right now are shadow deployments (creating business risk), live feedback cycles (straining resources) and continuous manual fine-tuning (creating alert and cognitive fatigue).

 

Teams are experimenting with tools like Retrieval Augmented Generation (RAG) pipelines that combine retrieval-based systems with LLMs for greater control, and real-time telemetry on agent reasoning steps to flag anomalies before they trigger user-facing errors.

 

As much as AI agents promise fast and efficient business processes, it’s prudent to incorporate human-in-the-loop systems and rely on tested response automation, treating agent rollouts slowly and carefully where any unpredictability can be observed, monitored and triaged with a lower risk profile. A human operator is essential when AI operates in critical operations of regulated sectors, such as healthcare or finance, where output must be vetted for accuracy and compliance before action is taken.

 

Operations leaders should ensure that real-time remediation workflows triggered by agent misbehaviour or hallucinations can be detected and immediately stopped, and automated routines that have been authorised to kick-in can stop ‘cascading.’ Slowing down AI processes when markers of abnormality are spotted is critical to ensuring incidents don’t compound, grow and deepen.

 

Underlying everything, operations teams should check that the systems used to observe should not be dependent on the systems used to operate. Use a ‘church-and-state’ model that maintains independence between observability and incident management to enable responders to act with confidence, even as systems misbehave.

 

 

Increase reliability in a non-deterministic world

The deep industry focus on AI innovation and rollout has led to AIOps becoming an essential discipline in its own right. As the complexity and interconnectedness of IT systems grow, including third-party dependencies and cloud services plus unpredictable AI services, there must be a corresponding focus on engineering for reliability. Most organisations now rely on cloud-hosted models or APIs from vendors they don’t control. A change or failure in these upstream systems can cause downstream chaos, with operations teams left scrambling to diagnose issues outside of their observability scope.

 

Forward-looking businesses must reframe reliability as an active and adaptive process, a business virtue designed not only to protect the business and support profitability, but for improved customer service and retention. Act by tracking metrics like time-to-detect for AI anomalies, rate of escalated AI incidents and number of unauthorised automated actions to measure operational exposure and maturity.

 

Enabling AI resilience through modern incident management must become a business priority as long as rollouts of this exciting, but non-traditional, technology add to business risk levels just as much as they offer benefits. 

 


 

João Freitas is GM & VP of Engineering for AI and Automation at PagerDuty

 

Main image courtesy of iStockPhoto.com and Nirunya Juntoomma

Business Reporter

Winston House, 3rd Floor, Units 306-309, 2-4 Dollis Park, London, N3 1HF

23-29 Hendon Lane, London, N3 1RT

020 8349 4363

© 2025, Lyonsdown Limited. Business Reporter® is a registered trademark of Lyonsdown Ltd. VAT registration number: 830519543