AI agents save hours on work that used to require careful, tedious effort. The agent does 80% of the job well, and the last 20% requires someone who knows what “right” looks like.

That remaining 20% is the interesting part.

What 80% Looks Like in Practice

On a project, we have skills that handle many parts of the development process. Write the user stories. Design the API endpoints. Stub the endpoints. Generate the end-to-end tests. Each of these, most of the time, produces something useful.

But “useful” isn’t the same as “correct.”

The story agent writes stories in our format. When I read what it produces, it’s close to what I would have written. Not always identical, not always exactly right, but close enough that a few corrections get it there. The endpoint design agent does a similar job. Most of the contract comes out well. Occasionally, it generates something that I look at and think: “No, that’s not how we do that here.” I correct it. The agent doesn’t learn in real-time, but I document the correction and refine the skill.

The test-writing agent is less reliable. It has a tendency to circle back on itself, repeating work it already did or missing something I would have caught in a review. I haven’t fully refined that one yet, because I haven’t gone back and built the correction loops into the skill. It’s still on my list.

The Correction Is the Valuable Part

The value I bring isn’t in doing the work the agent does. It’s in knowing where the agent is wrong.

You can’t rely on AI agents to do autonomous work in a domain you don’t understand well. The agent will produce something plausible, possibly even something that looks correct, and if you don’t know enough to evaluate it, you won’t catch the errors. The 80% will look like 100%.

This is why I tell new team members: you still have to know all the parts. Not because the agent won’t do most of them, but because when it doesn’t, you need to recognize what’s missing. You need to be able to look at a generated handler and say “that’s not right, we don’t use the synchronous version there.” You can only do that if you’ve written enough handlers to know the difference.

Why the Handoffs Are Hard

The other problem is that agents don’t hand off well to each other. On our project, we have skills for the individual steps, but the transitions between them still require human judgment.

At each transition, someone has to make a call:

  • The story agent finishes. Is the output good enough to move to endpoint design?

  • The endpoint design comes back. Does it need a review, or can we stub it directly?

  • The stubs are done. Can we split the work now, or is there a dependency to resolve first?

handoffs.png

None of those decisions is hard in isolation. But they require someone to look at what came out and say: Yes, we can proceed. Or: no, this needs another pass first. That’s the handoff. It’s lightweight, but it can’t be skipped.

I’ve been experimenting with orchestrating those handoffs, giving it the full plan upfront, and letting it direct the sequence. It works reasonably well for planning. But executing the plan, especially with sub-agents running in parallel, requires the individual skills to be reliable enough that the orchestrator doesn’t have to constantly course-correct. We’re not fully there yet.

Stage Four Requires More Than Bulletproof Agents

In our AI maturity model at Improving, Stage Four is where those individual task agents stop being a collection of separate tools and become a coordinated workflow managed by an orchestrating agent. The orchestrator doesn’t do the work itself. It calls each task agent in sequence, passes the output of one as the input of the next, and manages the overall state of the process.

This project is well-positioned to get there. We have agents for many of the steps. We have patterns that have been refined over many sprints. The infrastructure is solid.

What we’re missing is the arrows.

If the individual agents are the boxes, the arrows are the connections between them. Each one represents three things: a defined handoff (what data or artifact transfers from one agent to the next), a checkpoint (a deliberate pause for human review at high-risk transitions, like moving from draft code to opening a pull request), and an error boundary. When an upstream agent produces something the downstream agent can’t use, what happens? Retry? Escalate? Skip the step entirely?

Stage Four also changes your role. You’re no longer reviewing every individual task output. You review the final workflow result, or check in at the predefined connection points. But you don’t disappear. Those checkpoints exist precisely because not every transition can be trusted without a human in the loop.

Stage Four agents also have to be shared. Stage Three skills can be personal, shaped around how you work, your preferences, your shortcuts. Once you wire them into an orchestrated workflow the whole team depends on, they have to reflect shared quality gates, a shared definition of done, and shared decisions about where approvals are required. Skipping that conversation produces a workflow nobody fully trusts.

Right now, we don’t have confidence that every agent will consistently produce something the next one can build on. Until the handoffs are solid, problems compound. The endpoint design has a subtle mistake, the stub gets built on that mistake, the frontend gets built on the stub, and by the time anyone notices, untangling it is a real cost.

Getting to Stage Four means closing the gaps where the agents reliably fail, and encoding the connections between them with the same care as the agents themselves.

What I’m Paying Attention To

Every time an agent produces something I have to correct, I try to capture the pattern. What was the rule it missed? Is that rule documented somewhere the agent can find it? Is there a hook I can add to verify the output before we proceed?

Some of those corrections are small: a naming convention, an async method preference, a specific pattern for how we express validation conditions. But they accumulate. A system where every individual agent is 95% reliable gets unreliable fast when you chain several of them together. The compounding matters.

The path to autonomous execution runs through consistency, not capability. Getting agents to do their part reliably enough that the handoffs can be trusted is slower work than adding new features. But it’s what actually gets you to Stage Four.

Leave a Reply

Trending

Discover more from Claudio Lassala's Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading