Everyone can build an agent, says Hristo Borisov at Payhawk. But very few can trust one

When OpenAI launched AgentKit in October, it quickly became a reference point for the next wave of AI development. Reddit filled with screenshots of drag-and-drop workflows. X/Twitter buzzed with claims of a “new era of building.” It felt like the moment agents went mainstream.
But enterprise enthusiasm lagged behind developer excitement. While engineers shared demos, business leaders asked harder questions: What happens when these agents touch financial systems, approve spend, or reconcile accounts? What happens when they fail with real money on the line?
AgentKit became both a milestone and a mirror. It showed how easy agents have become to build—and how unclear reliability, governance, and accountability still are.
The new divide in AI: speed vs. trust
AgentKit deserves credit for lowering the barrier to entry. It gives developers a shared language for agent design: visual wiring, faster prototyping, and a common interface between models, data, and tools. For teams shipping early proofs of concept, that’s meaningful progress.
But those benefits mostly apply before the first real deployment.
AgentKit lives inside a closed stack and depends on OpenAI’s own models. Its core design is linear: one step waits for the previous one. That makes testing and debugging straightforward, but real-world workflows don’t stay linear. They branch, overlap, retry, and fail out of order.
You can already see the pattern in developer forums: week one is exhilaration, week two is frustration. As one engineer put it, “You can build an agent in a day, but you can’t keep it running for a month.”
The hard part isn’t building the agent. It’s keeping it stable, observable, and explainable once it’s live. AgentKit raises the floor for what anyone can build. The ceiling—agents you can rely on—still belongs to teams that design for trust from day one.
The speed illusion
Early adopters have demonstrated just how fast agents can be assembled. Invoice-coding bots built in hours. Procurement workflows stitched together in a sprint. Demos circulate widely, reinforcing the idea that iteration speed itself is now a competitive moat.
But speed is not resilience.
A “speed moat” assumes perfect conditions: no model outages, no rate limits, no API failures, no policy changes. Finance doesn’t work that way. It rewards consistency, traceability, and recovery when things go wrong. A single failed call or approval path out of order can erase weeks of rapid iteration.
Many early agents are built inside closed orchestration ecosystems. That simplifies the first version but concentrates risk. When every workflow depends on the same provider and routing logic, one failure can ripple across the system.
Engineers describe these frameworks the same way: “great for demos, brittle for workflows that can’t break.”
In finance, shipping fast only matters if what you ship keeps working. Speed without redundancy, observability, and control isn’t a moat—it’s momentum without endurance.
Why scale doesn’t equal intelligence
Across AI, scale is often treated as proof of progress: more data, more users, more intelligence.
In finance, that logic breaks.
Financial data doesn’t generalise. Every organisation has its own chart of accounts, approval hierarchies, and ERP configurations. Rules governing spend are local, specific, and legally binding. A pattern that works in one company can be a violation in another.
The smarter approach isn’t collecting more data, it’s respecting boundaries. Intelligence in finance comes from context: interpreting policy correctly, respecting permissions, and explaining every action taken. Approval thresholds, budget rules, and accounting structures must be sources of truth, not training material.
This progress is slower by design. A system that moves money must be auditable before it can be impressive. What matters isn’t how much data it sees, but whether every decision can be traced, explained, and reversed.
Scale creates convenience. Control creates trust. In finance, trust is the only metric that compounds.
Policy-bounded autonomy
If AgentKit made agents easier to build, the next frontier is making them behave.
The agents that matter in finance won’t just follow scripts. They’ll reason within boundaries. They’ll know what they can decide, what requires confirmation, and when to stop and ask for help.
This is higher-freedom, policy-bounded orchestration: agents that plan their own routes while staying inside governance rails.
These systems can handle non-linear workflows, switch tools when performance drops, keep state so retries don’t duplicate work, and explain themselves as they go. When something falls outside their remit, they escalate with context instead of leaving humans to clean up.
It’s autonomy inside the fence, with accountability at every step.
The trust layer: behavioural evaluation
Autonomy only works if you can prove behaviour.
Most AI metrics, including accuracy, latency, and benchmark scores, measure performance in isolation. They don’t capture what happens when workflows fail halfway through or when policy boundaries are tested. In finance, those moments matter most.
Before an agent handles company money, it should answer four questions:
1. Did it choose the right tool?
2. When something failed, did it recover correctly?
3. Did it stay within policy boundaries?
4. When it needed help, did it escalate with sufficient context?
Accuracy is easy to publish. Reliability has to be earned. As one CFO told us: “I don’t care if it answers fast. I care that it never acts twice without permission.”
Behavioural evaluation isn’t a product yet. It’s a standard—and a requirement for the next phase of automation.
The next frontier: trust
Every technology wave starts with speed. But once prototypes become infrastructure, the question changes—from how fast can we build it to how sure can we be that it works.
Agentic AI has solved the build problem: anyone can connect models, data, and APIs into something that looks intelligent. The harder work now is proving these systems behave consistently when the stakes are high.
Finance is the stress test. It measures technology by traceability, accuracy, and accountability under pressure. The next phase of agentic AI won’t be defined by bigger models or faster canvases. It will be defined by trust. Because innovation moves markets, but trust builds them.
Hristo Borisov is CEO at UK-based unicorn Payhawk
Main image courtesy of iStockPhoto.com and Sansert Sangsakawrat

© 2025, Lyonsdown Limited. Business Reporter® is a registered trademark of Lyonsdown Ltd. VAT registration number: 830519543