This blog is based on a conversation Test Double Principal Consultant David Mosher had with Traingent Co-Founder and CTO Wayne Marley, and the video is an excerpt from the same.
Humans in the lead, not just in the loop
The industry has settled on a phrase for how people and AI should work together: humans in the loop. It sounds responsible. It also quietly implies the AI is driving and the human is there to catch mistakes.
That framing has it backward.
I was at an AI conference in San Francisco recently and heard a leader flip it: we don't want humans in the loop. We want humans in the lead. That reframe is the whole argument of this post. Coding agents like Claude and Codex are genuinely capable. And in Wayne Marley's words when we were working together: extremely convincing. They produce plans that sound right. Code that compiles. Explanations that read as confident.
None of that is the same as being correct.
You can't outsource the thinking. If you try, the agents drift. The drift compounds. And the gap between "our demo works" and "we can operate this in production" becomes the place where most AI products quietly die.
Staying in the lead at the pace AI makes possible requires something bigger than good prompts or a careful reviewer. It requires a system—a harness—that encodes judgment into the repo itself, so the right things are easy and the wrong things are hard. That harness is what AI enablement actually means. Everything else is tooling, and tooling changes every six months.
The progression from conventions to enforcement
When I first started working on Traingent's Ride with Dave product, I had a specific mental model for what made a codebase agent-friendly. Opinionated folder structure. Scaffolded examples of every major component. If we had a worker process, there was an example of what a worker looked like. If we had a backend/frontend split, there were exemplars of both.
It was a reasonable start, and not enough at the same time.
The codebase needed something more deterministic than "here's an example, try to match it." When Jed Schneider joined the engagement, he pushed it forward by adding ESLint rules that enforced the package boundaries we had only been documenting. An import crossing a boundary it shouldn't cross now failed the lint check immediately—which meant the agent saw the error and self-corrected, without a human reviewer in the path.
That shift—from convention to enforcement—is the single highest-leverage move I can point to in this kind of work. Conventions live in documentation. Rules live in the CI pipeline. Agents respect rules more reliably than they respect prose. So do new engineers.
From there, the harness compounded. An ADR skill formalized how architecture decisions got recorded. A PR review skill encoded the architectural boundaries, security posture, and design intent that a senior reviewer would have carried in their head. Each piece captured something invisible and made it enforceable.
Wayne described what that meant from inside Traingent:
"There can be some perception that having overly verbose linting rules or building in these PR review specifics can be constrictive, but I haven't seen that. For us it's unlocked our ability to move quickly because we're not stumbling through rule violations or things that are incompatible with the scaffolding or the technical mindset that we've put together."
The fear with harness work is always the same: won't all these rules slow us down? In practice the opposite is true. The rules absorb the small frictions that used to eat developer attention—the Slack thread about where a file should go, the PR review that catches a boundary violation after the fact, the onboarding call explaining how we do things. When that friction disappears, attention redirects to the work that actually requires judgment.
Constraints don't slow good teams down. They free them up.
The invisible becomes enforceable
Software quality used to live in people's heads. If you wanted consistency across a team, you hired carefully, onboarded slowly, and hoped culture absorbed newcomers before bad patterns absorbed them. Reviews caught what they could. Documentation tried to fill the gaps. Neither fully worked. The knowledge was trapped in human context and evaporated the moment anyone left.
When you encode that knowledge—lint rules, ADR skills, review prompts, architectural constraints—something changes. Consistency stops being a property of who's around that day. It becomes a property of the system. New engineers onboard faster because the rules teach them. Agents produce code that fits because the rules constrain them. Reviews catch more because the review logic is itself encoded.
This is the part most teams underinvest in, because it looks like overhead. It isn't. It's the thing that makes everything downstream possible.
Observability is the other half of the harness
Architecture keeps the code honest. Observability keeps the running system honest. And with LLM-based products, this is where I see teams underinvest the most.
You cannot operate a competent LLM-based product without tracing, spans, and evaluations. The systems are non-deterministic by nature. When something looks wrong in the output, the root cause is almost never where the symptom appears. Maybe an instruction got misinterpreted three turns back. Maybe a date format got mishandled. Maybe a tool call returned partial data and the model confabulated its way through. Without observability, you are guessing.
Traingent implemented Langfuse for end-to-end tracing and added sentiment analysis as an asynchronous evaluation step. The sentiment work mattered for them specifically because cycling advice is hard to score against a rubric. A response can sound right—mentions cycling, talks about heart rate, uses plausible physiological concepts—and still be wrong because it applied road-cycling logic to someone on a mountain bike. The user will know. They'll respond with some version of “What are you talking about?” The sentiment eval catches that when a spec-based eval wouldn't have fired.
Wayne's take on this is helpful:
"You honestly can't have an LLM product that you're creating for a customer without observability. It's non-deterministic by nature. If you don't have visibility to the traces and the spans, you're completely guessing."
Treat observability as day-zero infrastructure. Not day one. Not "when we have users." Day zero. The traces you capture in development are the traces that let you debug production. The evaluations you build early are the ones that catch regressions when the model changes underneath you.
The dual benefit
The easy case for investing in harness work is that it makes agents more effective. That case is real and it's sufficient on its own. But there's a second effect that shows up harder to anticipate and matters just as much: the same investments make the next human dramatically more effective too.
When Traingent brought on a new engineer named Patrick, Wayne sat with Claude, walked through the architecture docs, the ADRs, and the roadmap, and generated a concrete two-to-three-week plan matching Patrick's skills against Traingent's actual gaps. That planning exercise would have taken days in a less-instrumented environment. Or it would have happened loosely, in someone's head, and produced a less honest estimate. The system that makes the agent productive is the system that makes onboarding fast.
Hold onto that. The investments look like they're for the agents. They're really for the team.
The speed trap
There's a version of this conversation that sounds like it's about moving fast, and that's not quite right.
The teams I see struggling right now aren't struggling because they can't ship. They're struggling because they can ship too easily, and the output compounds in the wrong direction. A prototype gets built in a weekend. Another gets grafted on. Features accumulate faster than tests. Observability becomes an afterthought. Within months the codebase is a liability and the team is afraid to touch it.
Rikky Singh, Traingent's CEO, describes it this way:
"We ship so fast that tickets become history books and not planning tools."
That only works because the harness is doing the careful work that makes the fast work safe. You can spin up a prototype in an afternoon now. Industrializing one still takes discipline. The difference between teams that scale and teams that stall is almost always whether they invested in that discipline before they thought they needed it.
What to take from this
Two things have held up across every engagement I've worked on in this space.
Invest in the harness, not the tools. What matters is whether your codebase makes the right things easy and the wrong things hard. Lint rules, ADR practices, review skills, scaffolding conventions, and observability will outlast any particular tool generation. The tools will change, so the harness is the real asset.
Design for humans in the lead. The entire point of the harness is to keep people guiding the work, with the system catching drift when it happens. If your setup assumes the AI is driving and a reviewer will catch the problems, you've already lost. Being intentional about your setup and harnesses will stop drift and help AI stay on the rails.
Dave Mosher is a Principal Consultant at Test Double, and has experience in legacy modernization, agentic coding, and explaining CORS poorly to people who didn't ask.
Wayne Marly is Co-founder and CTO at Traingent, a runner, and cyclist, and he’s building intelligent assistants with purpose, personality, and persistence.









