Skip to main content
Test Double company logo
Services
Pragmatic Services Overview
Holistic software investment consulting
Acccelerate Software Delivery
Balance efficiency and quality
Improve Product Impact
Drive results that matter
Upgrade Rails Seamlessly
Update Ruby and Rails versions
Scale DevOps
Dev experience and infrastructure
Technical Recruitment
Build tech & product teams
Case Studies
Solutions
Legacy Modernization
Renovate legacy software systems
Pragmatic AI
Solve business problems without hype
Technical & Product Assessments
Uncover root causes & improvements
About
About
What's a test double?
Approach
Meeting you where you are
Founder's Story
The origin of our mission
Culture
Culture & Careers
Double Agents decoded
Great Causes
Great code for great causes
EDI
Equity, diversity & inclusion
Insights
All Insights
Hot takes and tips for all things software
Leadership
Bold opinions and insights for tech leaders
Developer
Essential coding tutorials and tools
Product Manager
Practical advice for real-world challenges
Say Hello
Test Double logo
Menu
Services
BackGrid of dots icon
Services Overview
Holistic software investment consulting
Software Delivery
Accelerate quality software development
Product Impact
Drive results that matter
Cycle icon
DevOps
Scale infrastructure smoothly
Upgrade Rails
Update Rails versions seamlessly
Technical Recruitment
Build tech & product teams
Case Studies
Solutions
Solutions
Legacy Modernization
Renovate legacy software systems
Pragmatic AI
Solve business problems without hype
Technical & Product Assessments
Uncover root causes & improvements
About
About
About
What's a test double?
Approach
Meeting you where you are
Founder's Story
The origin of our mission
Culture
Culture
Culture & Careers
Double Agents decoded
Great Causes
Great code for great causes
EDI
Equity, diversity & inclusion
Insights
Insights
All Insights
Hot takes and tips for all things software
Leadership
Bold opinions and insights for tech leaders
Developer
Essential coding tutorials and tools
Product Manager
Practical advice for real-world challenges
Say hello
Developers
Developers
Developers
AI

What it takes to keep humans in the lead with AI

Teams actually shipping AI products build a system that keeps agents honest with humans in charge.
Dave Mosher
|
June 9, 2026
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

This blog is based on a conversation Test Double Principal Consultant David Mosher had with Traingent Co-Founder and CTO Wayne Marley, and the video is an excerpt from the same.

Humans in the lead, not just in the loop

The industry has settled on a phrase for how people and AI should work together: humans in the loop. It sounds responsible. It also quietly implies the AI is driving and the human is there to catch mistakes.

That framing has it backward.

I was at an AI conference in San Francisco recently and heard a leader flip it: we don't want humans in the loop. We want humans in the lead. That reframe is the whole argument of this post. Coding agents like Claude and Codex are genuinely capable. And in Wayne Marley's words when we were working together: extremely convincing. They produce plans that sound right. Code that compiles. Explanations that read as confident.

None of that is the same as being correct.

You can't outsource the thinking. If you try, the agents drift. The drift compounds. And the gap between "our demo works" and "we can operate this in production" becomes the place where most AI products quietly die.

Staying in the lead at the pace AI makes possible requires something bigger than good prompts or a careful reviewer. It requires a system—a harness—that encodes judgment into the repo itself, so the right things are easy and the wrong things are hard. That harness is what AI enablement actually means. Everything else is tooling, and tooling changes every six months.

The progression from conventions to enforcement

When I first started working on Traingent's Ride with Dave product, I had a specific mental model for what made a codebase agent-friendly. Opinionated folder structure. Scaffolded examples of every major component. If we had a worker process, there was an example of what a worker looked like. If we had a backend/frontend split, there were exemplars of both.

It was a reasonable start, and not enough at the same time.

The codebase needed something more deterministic than "here's an example, try to match it." When Jed Schneider joined the engagement, he pushed it forward by adding ESLint rules that enforced the package boundaries we had only been documenting. An import crossing a boundary it shouldn't cross now failed the lint check immediately—which meant the agent saw the error and self-corrected, without a human reviewer in the path.

That shift—from convention to enforcement—is the single highest-leverage move I can point to in this kind of work. Conventions live in documentation. Rules live in the CI pipeline. Agents respect rules more reliably than they respect prose. So do new engineers.

From there, the harness compounded. An ADR skill formalized how architecture decisions got recorded. A PR review skill encoded the architectural boundaries, security posture, and design intent that a senior reviewer would have carried in their head. Each piece captured something invisible and made it enforceable.

Wayne described what that meant from inside Traingent:

"There can be some perception that having overly verbose linting rules or building in these PR review specifics can be constrictive, but I haven't seen that. For us it's unlocked our ability to move quickly because we're not stumbling through rule violations or things that are incompatible with the scaffolding or the technical mindset that we've put together."

The fear with harness work is always the same: won't all these rules slow us down? In practice the opposite is true. The rules absorb the small frictions that used to eat developer attention—the Slack thread about where a file should go, the PR review that catches a boundary violation after the fact, the onboarding call explaining how we do things. When that friction disappears, attention redirects to the work that actually requires judgment.

Constraints don't slow good teams down. They free them up.

The invisible becomes enforceable

Software quality used to live in people's heads. If you wanted consistency across a team, you hired carefully, onboarded slowly, and hoped culture absorbed newcomers before bad patterns absorbed them. Reviews caught what they could. Documentation tried to fill the gaps. Neither fully worked. The knowledge was trapped in human context and evaporated the moment anyone left.

When you encode that knowledge—lint rules, ADR skills, review prompts, architectural constraints—something changes. Consistency stops being a property of who's around that day. It becomes a property of the system. New engineers onboard faster because the rules teach them. Agents produce code that fits because the rules constrain them. Reviews catch more because the review logic is itself encoded.

This is the part most teams underinvest in, because it looks like overhead. It isn't. It's the thing that makes everything downstream possible.

Observability is the other half of the harness

Architecture keeps the code honest. Observability keeps the running system honest. And with LLM-based products, this is where I see teams underinvest the most.

You cannot operate a competent LLM-based product without tracing, spans, and evaluations. The systems are non-deterministic by nature. When something looks wrong in the output, the root cause is almost never where the symptom appears. Maybe an instruction got misinterpreted three turns back. Maybe a date format got mishandled. Maybe a tool call returned partial data and the model confabulated its way through. Without observability, you are guessing.

Traingent implemented Langfuse for end-to-end tracing and added sentiment analysis as an asynchronous evaluation step. The sentiment work mattered for them specifically because cycling advice is hard to score against a rubric. A response can sound right—mentions cycling, talks about heart rate, uses plausible physiological concepts—and still be wrong because it applied road-cycling logic to someone on a mountain bike. The user will know. They'll respond with some version of “What are you talking about?” The sentiment eval catches that when a spec-based eval wouldn't have fired.

Wayne's take on this is helpful:

"You honestly can't have an LLM product that you're creating for a customer without observability. It's non-deterministic by nature. If you don't have visibility to the traces and the spans, you're completely guessing."

Treat observability as day-zero infrastructure. Not day one. Not "when we have users." Day zero. The traces you capture in development are the traces that let you debug production. The evaluations you build early are the ones that catch regressions when the model changes underneath you.

The dual benefit

The easy case for investing in harness work is that it makes agents more effective. That case is real and it's sufficient on its own. But there's a second effect that shows up harder to anticipate and matters just as much: the same investments make the next human dramatically more effective too.

When Traingent brought on a new engineer named Patrick, Wayne sat with Claude, walked through the architecture docs, the ADRs, and the roadmap, and generated a concrete two-to-three-week plan matching Patrick's skills against Traingent's actual gaps. That planning exercise would have taken days in a less-instrumented environment. Or it would have happened loosely, in someone's head, and produced a less honest estimate. The system that makes the agent productive is the system that makes onboarding fast.

Hold onto that. The investments look like they're for the agents. They're really for the team.

The speed trap

There's a version of this conversation that sounds like it's about moving fast, and that's not quite right.

The teams I see struggling right now aren't struggling because they can't ship. They're struggling because they can ship too easily, and the output compounds in the wrong direction. A prototype gets built in a weekend. Another gets grafted on. Features accumulate faster than tests. Observability becomes an afterthought. Within months the codebase is a liability and the team is afraid to touch it.

Rikky Singh, Traingent's CEO, describes it this way:

"We ship so fast that tickets become history books and not planning tools."

That only works because the harness is doing the careful work that makes the fast work safe. You can spin up a prototype in an afternoon now. Industrializing one still takes discipline. The difference between teams that scale and teams that stall is almost always whether they invested in that discipline before they thought they needed it.

What to take from this

Two things have held up across every engagement I've worked on in this space.

Invest in the harness, not the tools. What matters is whether your codebase makes the right things easy and the wrong things hard. Lint rules, ADR practices, review skills, scaffolding conventions, and observability will outlast any particular tool generation. The tools will change, so the harness is the real asset.

Design for humans in the lead. The entire point of the harness is to keep people guiding the work, with the system catching drift when it happens. If your setup assumes the AI is driving and a reviewer will catch the problems, you've already lost. Being intentional about your setup and harnesses will stop drift and help AI stay on the rails.

Dave Mosher is a Principal Consultant at Test Double, and has experience in legacy modernization, agentic coding, and explaining CORS poorly to people who didn't ask.

Wayne Marly is Co-founder and CTO at Traingent, a runner, and cyclist, and he’s building intelligent assistants with purpose, personality, and persistence.

Want quality agentic coding?

From AI fluency to agentic coding with guardrails, we're here to help!

Pragmatic AI services

Related Insights

🔗
Quality you can’t generate: AI is only as good as your constraints
🔗
Anyone can code: Software Is having its Ratatouille moment

Explore our insights

See all insights
Developers
Developers
Developers
Introducing Han: A research, plan, and implement plugin, without the rails

There are a lot of good ways to bring research, planning, and implementation structure to AI coding tools. Han is built for people who would rather pick their own path than ride someone else's track.

by
River Lynn Bailey
Product managers
Product managers
Product managers
Great company cultures hold people accountable

Principles without consequences are aspirations. Here's the framework for turning codified culture into real behavioral change.

by
Michael Toland
Leadership
Leadership
Leadership
What is legacy software in the age of AI?

Software becomes legacy by succeeding long enough to accumulate the weight of every decision and shortcut made along the way. Users feel it as friction. Engineers feel their momentum slip. The business watches costs climb. AI changes two of those. The third may be getting worse.

by
Todd Kaufman
Letter art spelling out NEAT

Join the conversation

Technology is a means to an end: answers to very human questions. That’s why we created a community for developers and product managers.

Explore the community
Test Double Executive Leadership Team

Learn about our team

Like what we have to say about building great software and great teams?

Get to know us
Test Double company logo
Improving the way the world builds software.
What we do
Services OverviewSoftware DeliveryProduct StrategyLegacy ModernizationPragmatic AIDevOpsUpgrade RailsTechnical RecruitmentAssessments
Who WE ARE
About UsCulture & CareersGreat CausesEDIOur TeamContact UsNews & AwardsN.E.A.T.
Resources
Case StudiesAll InsightsLeadership InsightsDeveloper InsightsProduct InsightsPairing & Office Hours
NEWSLETTER
Sign up hear about our latest innovations.
Your email has been added!
Oops! Something went wrong while submitting the form.
Standard Ruby badge
614.349.4279hello@testdouble.com
Privacy PolicyTerms & Conditions
© 2020 Test Double. All Rights Reserved.

Want quality agentic coding?

From AI fluency to agentic coding with guardrails, we're here to help!

Pragmatic AI services