Skip to main content
Test Double company logo
Services
Services Overview
Holistic software investment consulting
Software Delivery
Accelerate quality software development
Product Strategy & Performance
Level up product strategy & performance
Legacy Modernization
Renovate legacy software systems
Pragmatic AI
Solve business problems without hype
Upgrade Rails
Update Rails versions seamlessly
DevOps
Scale infrastructure smoothly
Technical Recruitment
Build tech & product teams
Technical & Product Assessments
Uncover root causes & improvements
Case Studies
Solutions
Accelerate Quality Software
Software Delivery, DevOps, & Product Delivery
Maximize Software Investments
Product Performance, Product Scaling, & Technical Assessments
Future-Proof Innovative Software
Legacy Modernization, Product Transformation, Upgrade Rails, Technical Recruitment
About
About
What's a test double?
Approach
Meeting you where you are
Founder's Story
The origin of our mission
Culture
Culture & Careers
Double Agents decoded
Great Causes
Great code for great causes
EDI
Equity, diversity & inclusion
Insights
All Insights
Hot takes and tips for all things software
Leadership
Bold opinions and insights for tech leaders
Developer
Essential coding tutorials and tools
Product Manager
Practical advice for real-world challenges
Say Hello
Test Double logo
Menu
Services
BackGrid of dots icon
Services Overview
Holistic software investment consulting
Software Delivery
Accelerate quality software development
Product Strategy & Performance
Level up product strategy & performance
Legacy Modernization
Renovate legacy software systems
Pragmatic AI
Solve business problems without hype
Cycle icon
DevOps
Scale infrastructure smoothly
Upgrade Rails
Update Rails versions seamlessly
Technical Recruitment
Build tech & product teams
Technical & Product Assessments
Uncover root causes & improvements
Case Studies
Solutions
Solutions
Accelerate Quality Software
Software Delivery, DevOps, & Product Delivery
Maximize Software Investments
Product Performance, Product Scaling, & Technical Assessments
Future-Proof Innovative Software
Legacy Modernization, Product Transformation, Upgrade Rails, Technical Recruitment
About
About
About
What's a test double?
Approach
Meeting you where you are
Founder's Story
The origin of our mission
Culture
Culture
Culture & Careers
Double Agents decoded
Great Causes
Great code for great causes
EDI
Equity, diversity & inclusion
Insights
Insights
All Insights
Hot takes and tips for all things software
Leadership
Bold opinions and insights for tech leaders
Developer
Essential coding tutorials and tools
Product Manager
Practical advice for real-world challenges
Say hello
Developers
Developers
Developers
Software tooling & tips

Keep your coding agent on task with mutation testing

Code quality tools are helpful guardrails for humans, but coding agents benefit even more. Mutation testing is a rarely-used tool showing new promise as we leverage AI to write more and more software.
Neal Lindsay
|
October 21, 2025
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

AI coding agents are great at generating code. In fact, they'll happily generate too much code. Working effectively with them means getting good at asking for what you want up front, but also getting good at paring down their output.

The more of that pruning process you can foist back onto the computer, the tighter your iteration loop will be.

For instance, I've always gotten value out of code linters and formatters, but LLMs benefit from them so much more. I can't tell you how often Claude has created unused variables that my linter flags before I even open the file.

Tell your coding agent to run formatting, linting, typechecking (if available), and your unit tests after every change. It makes a huge difference in the quality of the code produced.

My tests have to be good?

Yes, this presumes you have unit tests and that they are effective. You can (and should) have the agent write tests, but ensuring they are effective is trickier. There is no full substitute to reading and understanding the code, but we want the computer to solve as many of our problems as possible.

To that end I have had a lot of success lately using mutation testing.

What is mutation testing?

A mutation testing tool intentionally breaks your code and then runs your tests. If your tests pass when your code is broken, they probably need improvement.

I first encountered this concept with Heckle, a gem designed to break your Ruby code. (I just now realized that gem is old enough to vote.) I knew I wanted something like that for TypeScript and found Stryker Mutator.

Stryker is a well-maintained collection of tools with a long history. In addition to JavaScript/TypeScript there are versions available for C# and Scala. The JavaScript version has explicit support for most popular test runners like Jest and Vitest, but it will happily run whatever your npm test script is if you're using something obscure like teenytest.

How to use Stryker

To install it run: npm install --save-dev @stryker-mutator/core and npx stryker init, which helps you create a config file.

The stryker run command will try to mutate all of your non-test code. If you're concentrating on a particular file though you can tell Stryker that with -m filename on the command-line. The output looks something like this:

% npx stryker run -m lib/clean_cache.js
16:06:53 (96157) INFO ProjectReader Found 1 of 21 file(s) to be mutated.
16:06:53 (96157) INFO Instrumenter Instrumented 1 source file(s) with 19 mutant(s)
16:06:53 (96157) INFO ConcurrencyTokenProvider Creating 9 test runner process(es).
16:06:53 (96157) INFO DryRunExecutor Starting initial test run (command test runner with "off" coverage analysis). This may take a while.
16:06:54 (96157) INFO DryRunExecutor Initial test run succeeded. Ran 1 tests in 0 seconds (net 847 ms, overhead 0 ms).
Mutation testing  [==================================================] 100% (elapsed: <1m, remaining: n/a) 19/19 Mutants tested (1 survived, 0 timed out)

All tests
  ✓ All tests (killed 18)

[Survived] ConditionalExpression
lib/clean_cache.js:4:7
-     if (sizeMax === Infinity) return
+     if (false) return

Ran 1.00 tests per mutant on average.
----------------|------------------|----------|-----------|------------|----------|----------|
                | % Mutation score |          |           |            |          |          |
File            |  total | covered | # killed | # timeout | # survived | # no cov | # errors |
----------------|--------|---------|----------|-----------|------------|----------|----------|
All files       |  94.74 |   94.74 |       18 |         0 |          1 |        0 |        0 |
 clean_cache.js |  94.74 |   94.74 |       18 |         0 |          1 |        0 |        0 |
----------------|--------|---------|----------|-----------|------------|----------|----------|
16:06:59 (96157) INFO MutationTestExecutor Done in 5 seconds.

In this example, Stryker found a condition that it hard-coded to `false` without breaking any tests. Here it is again by itself:

[Survived] ConditionalExpression
lib/clean_cache.js:4:7
-     if (sizeMax === Infinity) return
+     if (false) return

This is an example of the limits of mutation testing. This line is an optimization in my caching library to bail out early if the programmer hasn't set a maximum size for the cache. The code works without this optimization, so it is very hard to test around it. I can either abandon the optimization, complicate my code to make the optimization testable, or live with a less-than-100% score. Stryker does not have any facility for "magic comments" to disable checks the way linters do. Fortunately in my testing both Claude and Codex "figured out" that mutations like this are not a problem and didn't insist on getting to 100%.

The scare quotes around the phrase "figured out" are because I don't really know why these agents do what they do. The thing that happens is that the agents sometimes give up before reaching 100%. Sometimes they have reasonable-sounding explanations why giving up was the right thing to do. I am guessing that the agent has a limit on how long it will bang around on a puzzle. If we ask them to solve the halting problem, maybe they would "recognize" it as unsolvable. Maybe they would hit some internal limit and justify giving up by saying the problem is unsolvable. It's interesting to think about, but a little outside the scope of this post. Anyhow, having the agent give up can be good, but I'll mention it again when I talk about the limitations of mutation testing.

Running mutations in your terminal is a great way to check if the coding agent did its homework, but having the agent itself run mutations lets it observe the output directly and run multiple passes. It can take some coaxing, but when it complies it can increase the quality of the tests it writes dramatically, requiring your attention less often.

Asking the robot to run the mutation tool

I have a "mutate" script set up in my package.json that will run mutation only against changed files in the src/ directory.

{ git diff --name-only -z HEAD -- src/ ; git ls-files --others --exclude-standard -z -- src/ ; } | xargs -0 printf '%s,' | xargs stryker run -m

(It's a little wordy in order to get both tracked and untracked files, and to join file names together with commas as the -m option expects.)

Most of the time when I ask Claude or Codex for a new feature it will look something like this:

Implement feature 3 from TODO.md. Make sure you write tests and see them fail before you implement the feature. Once you've written the implementation, check to make sure your tests are good by running `npm run mutate`.

I generally put instructions about things like this in CLAUDE.md or AGENTS.md, but sometimes I have to be explicit in my prompt.

When you think the tests around a particular file could be better you can ask the agent for specifically that:

This is a ten-minute screen recording condensed into about half a minute of looping GIF. I included it as an illustration, but I am worried people will try to actually read it. Please spare yourself the headache and trust my summary below.

This is a very sped-up recording of me asking Claude:

Please run `npx stryker run -m src/store/serial.ts` and use that information to test serial.ts more thorougyly.

(And yes, I misspelled "thoroughly."  At least LLMs are forgiving of typos.)

After my prompt Claude does indeed run the Stryker command I gave it and writes some tests. In the middle somewhere it has to compact the context. After writing tests it ran the Stryker command again. At the end it summarizes its changes and reports that it increased the final score to 96.30%. Hilariously, when I double-checked in a different terminal the final score was 94.44%—still a good number, but also a reminder to not take these tools at their word.

Doesn't code coverage give us this?

Code coverage tools will give you some of the value you get from mutation testing, but not all. For instance, the line that Stryker flagged in my first example would not have been flagged by a code coverage tool.

I think of mutation testing as a more strict version of code coverage. And when you're trying to wrangle the profuse code generation an LLM gives you into code that actually does what you want, strictness is gold.

That doesn't mean you should ignore code coverage. If you have it configured with a test runner that Stryker has explicit support for, Stryker is smart enough to leverage it to decide what tests to run. If it sees that src/a.ts is only covered by test/a.test.ts, it can run just that test file when it mutates src/a.ts. This will make a huge difference to mutation testing run times as your test suite grows. Furthermore, if you have coverage minimums you can enforce those before running mutation the first time, giving your coding agent faster feedback. All this makes code coverage even more useful if you're using Stryker.

Limitations of mutation testing with coding agents

We've already talked about mutation testing complaining about difficult-to-test things like my early return optimization. There are a few other things that might cause you problems.

I mentioned LLMs giving up before reaching 100% as a good thing, but there's no guarantee they won't give up too early. Unlike your test suite or your linter which can give clear pass/fail results, Stryker can only point you to code that might be a problem. So when the machine says "I got it to 80% and that's pretty good." you might have to apply a critical eye and ask it to revisit a line that really looks testable.

However, the main problem most people will run into is unit test suite duration. If the run time of your tests caused you problems before, imagine running them for every possible mutation. It can get out of hand quickly. As your codebase grows, you're going to have to tell your LLM to only run mutations against the files that have changed. And as your test suite grows, Stryker's use of code coverage to run a subset of tests will be important as well.

The future

In this coding-agent-ified world, stocks are way up on code quality tools. Mutation testing was always too far down the list for me to reach for in the past, but I am now excited by the value I'm getting out of it. I expect that these tools will eventually start to change to make themselves even more useful to coding agents, and I wonder what new tools will emerge to help us keep them on task.

‍Neal Lindsay is a Senior Software Consultant at Test Double, and has experience in JavaScript, Ruby, Rails, and sudden changes in how software development gets done.

Related Insights

🔗
From resistance to results: Why agentic coding requires a new mindset
🔗
You’re holding it wrong! The double loop model for agentic coding
🔗
Pragmatic approaches to agentic coding and AI for engineering leaders

Explore our insights

See all insights
Leadership
Leadership
Leadership
AI in the workforce: A shifting talent equation

AI is transformative: reshaping the workforce, blurring roles, and breaking career ladders. Explore how orgs can balance people, technology, and business in this new era of work.

by
Jonathon Baugh
Leadership
Leadership
Leadership
The anvil of alignment: The value of monoliths over microservices

Before dismantling your monolith for microservices, understand its crucial role in organizational alignment. We explore hidden costs, Conway's Law, and strategic insights for leaders making architectural decisions.

by
Dave Mosher
by
Joel Helbling
by
Steve Jackson
by
Patrick Brown
by
Jonathon Baugh
Developers
Developers
Developers
Build common (virtual) ground with unstructured remote calls

Taking remote work to new levels takes new techniques. Explore how unstructured remote calls can help build trust.

by
Tyler Sloane
Letter art spelling out NEAT

Join the conversation

Technology is a means to an end: answers to very human questions. That’s why we created a community for developers and product managers.

Explore the community
Test Double Executive Leadership Team

Learn about our team

Like what we have to say about building great software and great teams?

Get to know us
Test Double company logo
Improving the way the world builds software.
What we do
Services OverviewSoftware DeliveryProduct StrategyLegacy ModernizationPragmatic AIDevOpsUpgrade RailsTechnical RecruitmentAssessments
Who WE ARE
About UsCulture & CareersGreat CausesEDIOur TeamContact UsNews & AwardsN.E.A.T.
Resources
Case StudiesAll InsightsLeadership InsightsDeveloper InsightsProduct InsightsPairing & Office Hours
NEWSLETTER
Sign up hear about our latest innovations.
Your email has been added!
Oops! Something went wrong while submitting the form.
Standard Ruby badge
614.349.4279hello@testdouble.com
Privacy Policy
© 2020 Test Double. All Rights Reserved.