AI and the risk of unintended behaviour

AI & Automation12 Jun 2025

Jeremy Swinfen Green considers how business leaders should manage the risks of AI systems developing unintended behaviour

Implementing artificial intelligence systems in any business comes with considerable benefits, including increased efficiency, enhanced insight generation, and creativity. But it also comes with risks.

Many of these risks are associated with regulatory compliance. They include problems with privacy, often caused by poor data management, and equality issues, often stemming from the use of flawed data to train algorithms. There are other risks as well, many of which are associated with the negative ways people within an organisation react to the use of AI-powered tools and processes.

These risks are often the ones that are emphasised when people talk about AI governance. And they are, of course, significant. But there is another set of risks that business leaders must also be aware of. These risks are not caused by poor data or inadequate business processes. Instead, they are caused by the design of the AI system. This means that once they arise, they may be very difficult to manage.

AI system risks

Design flaws often result in AI systems displaying unintended, unpredictable and very damaging behaviour. As such, they should be central to any risk management activities around AI.

While there are many such system risks, there are a small number that seem to make people particularly worried. AI systems that refuse to be shut down. Automated machinery that deliberately harms people. IT networks that stage a takeover of critical infrastructure. AI systems that conspire together in unknown languages to do who knows what...

These scenarios are the stuff of nightmares. So what can realistically be done about them?

A refusal to shut down

The paperclip problem is a ten-year-old thought experiment. In it, an AI system is designed to manufacture paperclips. It does so, gradually taking more and more of the Earth’s resources (crops, houses, people, mountains) to do this, and resisting all human attempts to stop it – such attempts are seen as conflicting with its only purpose – to make paperclips.

A paperclip apocalypse caused by runaway AI is very unlikely, and the idea is based on an oversimplified idea of how AI works. AI (at least as we know it today) has no intrinsic will. If an AI system refuses a shutdown command, it would be due to a programming error or a poorly designed goal system, not because it "wants" to fulfil its purpose.

However, that doesn’t mean out-of-control AI systems that are hard to switch off could never exist. Systems, therefore, should be designed with a defence against this type of rigidity built into them. These include shutdown protocols, such as the use of hardware "kill switches" or software commands that bypass regular operation. In addition, goal systems should be set up with commands that include explicit acknowledgement that shutdown is acceptable or even desirable in some conditions. And systems should be designed so that they can accept updates to their goals without resisting.

Malign or damaging outputs

A brilliant IT technician I know was asked whether he could solve a data privacy problem. He answered “Yes”. A couple of days passed, and the problem still hadn’t been resolved. “Why haven’t you sorted this?” the CEO asked crossly. “Because you never asked me to,” the technician replied, “You just asked me if I could sort it out”.

AI systems will typically optimise their actions exactly in line with what they are told to achieve, rather than what their human creators may have intended. Poorly specified goals can lead to unintended behaviour: if the wrong metrics are used, then the wrong outputs will be delivered.

If you ask an AI system to maximise click-through rates, it may do so by creating sensationalist or “clickbait” content, and suddenly your plumbing company website is encouraging class warfare. A “click” isn’t the only thing you wanted. You also wanted the content to increase interest in your plumbing offer – but you forgot to ask the AI to deliver that.

This problem is known as “reward hacking”: poorly instructed AI models find shortcuts to achieve objectives without actually fulfilling the spirit of the task (they reach the right outcome but in the wrong way). It is similar to “specification gaming”, where AI systems “cheat” by acting in ways that meet targets but produce undesirable real-world outcomes (they achieve an outcome that is technically correct but not what the programmers intended).

One key to avoiding these problems is to be clear about what you want to achieve (increased interest in my plumbing services) as well as stating what you don’t want to achieve, and to avoid expressing this by describing intermediate steps (clicking on links on my home page).

Uncontrolled autonomy

The scope of an AI’s permissions within business systems can pose serious risks. An AI system granted broad access to databases, email systems and APIs might increase its permission levels and start acting beyond its intended scope if it is “rewarded” for doing so. This is especially true for generative AI systems that can create executable code.

Problems can also be caused by a failure to limit permissions (as opposed to limiting the ability to change permission levels). Autonomous agents designed to interact with other tools (e.g., customer service systems and procurement platforms) may start to create new workflows without being instructed to do so by humans; this could have a real-world impact, such as ordering goods, altering pricing, or sending communications.

To prevent these situations from occurring, systems must be designed with guardrails in place, such as limits to the access to or control of critical systems and tools that AI systems are given. For example, this may include not allowing an AI system access to the internet or financial systems except for strictly specified purposes.

Working in secret

Two AI systems may be designed to collaborate. However, when they do so, they may transition from communicating using human-understandable language to using self-generated languages. This phenomenon was first observed in 2017 when experimenters at Facebook discovered two AI programmes chatting to each other using a strange and unintelligible language.

It may well be more efficient for two AIs to avoid the constraints and unnecessary (to an AI) elements of human language. Indeed, a communication form designed to increase the speed with which two AI systems can communicate has been proposed: “Gibberlink mode”.

The difficulty arises when two systems communicate in ways that humans cannot interpret, making it impossible for humans to understand what is being communicated and the potential outcomes of the communication, especially in the medium and long term. To prevent this problem, systems should be designed with AI decision-making that is explainable, allowing humans to understand why certain decisions are being made.

Keeping control of AI systems

Considerable research is being conducted into the problems outlined above (and other similar issues). Unfortunately, as AI grows in power, these areas will only become more problematic. Most business leaders will be unable to get their hands dirty with the design of AI systems. However, they can at least ensure that they are aware of what might potentially go wrong and thus be able to interrogate the design and development teams about the defences being put in place.

Business leaders can also ensure that general safety principles are written into governance and design processes, including:

Alignment with business and ethical goals. AI should be designed to enhance business success while also respecting and adhering to ethical and legal frameworks. Misaligned incentives or overly simplified objectives can lead to harmful consequences.
Controllability. AI systems should be built so that they are always subject to human control. Human operators should be able to approve or veto actions, guiding the AI safely.
Transparency and explainability. As far as possible, AI decisions should be explainable, at least to AI engineers, ideally to non-specialists. “Black box” outputs will rarely be trusted (except in very routine processes), resulting in a loss of trust and even in the sabotaging of AI systems by employees. In addition, where AI outputs cannot be explained, there will inevitably be a diminishing of human accountability. To support this, development teams should be required to document their processes and decisions, including intended use cases, any known limitations, the source and quality of training data, and the outcomes of tests
Risk assessment and testing. Strong risk assessment processes should be put in place before any AI systems are implemented. Development teams should be diligent and proactive in searching for risks. Systems should be tested in deliberately adversarial conditions, or with unlikely scenarios, to evaluate how the system performs with unusual inputs, partial failures of connected systems, unexpected constraints or conflicting goals. Where appropriate, new AI models should be tested in safe, sandbox environments and outputs analysed for obvious or potential dangers. Third-party audits and red-teaming should be in place for high-risk AI systems.
Continuous monitoring and human oversight. A degree of responsibility for outputs will almost always be necessary, and there should be limits on how long the AI can run autonomously without human supervision. Humans should monitor outputs and validate the appropriateness of decisions. Most AI systems will learn from human preferences and feedback, so human oversight should improve the effectiveness of the AI system. Systems should not be left to run on their own without supervision for extended periods, as models can drift over time: what starts as a useful tool could eventually become a damaging liability.

Today’s AI systems are highly unlikely to deliver the type of catastrophic outputs seen in some science fiction films and novels. They are, after all, little more than prediction engines and have no malign will of their own. However, they are powerful, just as cars and factory machines are powerful. They should therefore be built, implemented and operated with respect and caution.

Business leaders must not be blind to the risks they present. But neither should they turn away from the enormous potential benefits of AI and automation because they are afraid of those risks or have failed to understand them. AI systems are just machines. And a well-managed machine will rarely cause unexpected problems for its owner

Main image courtesy of iStockPhoto.com and sankai