There's a catch with 100% coverage. If the agent writes both the code and the te...

ben_w · 2025-12-30T12:15:21 1767096921

All the problems you list are true, but the solutions not so much.

I've seen this problem with humans even back at university when it was the lecturer's own example attempting to illustrate the value of formal methods and verification.

I would say the solution is neither "get humans to do it" nor "do it before writing code", but rather "get multiple different minds involved to check each other's blind spots, and no matter how many AI models you throw at it they only count as one mind even when they're from different providers". Human tests and AI code, AI tests and human code, having humans do code reviews of AI code or vice-versa, all good. Two different humans usually have different blind spots, though even then I've seen some humans bully their way into being the only voice in the room with the full support of their boss, not that AI would help with that.

godelski · 2025-12-31T03:18:14 1767151094

  > "get multiple different minds involved to check each other's blind spots

This is actually my big gripe about chatbot coding agents. They are trained on human preference and thus they optimize for errors that are in our blind spots.

I don't think people take this subtly seriously enough. Unless we have an /objective/ ground truth we end up proxying our optimization. So we don't optimize for code that /is/ correct, we optimize for code that /looks/ correct. It may seem like a subtle difference but it is critical.

The big difference is when they make errors they are errors that are more likely to be difficult for humans to detect.

Good tools should complement tool users. Fill in gaps. But as we've been trying to train agents to replace humans we are not focusing on this distinction. I want my coding agent to make errors that are obvious to me just as I want errors I make to be obvious to it (or for it to be optimized to detect errors I make)

KurSix · 2026-01-07T08:15:09 1767773709

Exactly, we've basically trained an army of perfect corporate suck-ups. AI optimizes for the "least resistance at review" metric, not "logic correctness," so relying on eyeballs to check AI code right now is risky, we need "unfeeling" validators: compilers, formal verification, and rigid tests

melagonster · 2025-12-30T17:59:43 1767117583

Maybe this is because humans have good intuition to know the difference between us. But this type of intuition does not work on the behaviour of LLMs.

joshribakoff · 2025-12-30T13:54:04 1767102844

That’s why you’ve gotta test your tests. Insert bugs and ensure they fail.

As the sibling comments alluded to, it’s not exclusively an AI problem since multiple people can miss the issue too.

It’s wonderful that AI is an impetus for so many people to finally learn proper engineering principles though!

KurSix · 2026-01-07T08:19:04 1767773944

Mutation testing is becoming the only way to catch AI red-handed. Without mutations you'll be staring at a perfect CI/CD dashboard, unaware that your tests verify absolutely nothing

Yeah, it burns CPU like crazy, but CPU time is dirt cheap right now compared to the cost of an engineer debugging that self-deception in production

vrighter · 2025-12-30T14:05:00 1767103500

but who will test the tests of tests?

Supermancho · 2025-12-30T14:49:08 1767106148

The double entry technique is the most effective path to ensure accuracy (best tradeoffs for time vs accuracy) in finance and software. ie Triple book accounting has not been the standard because it's a bad tradeoff. It requires a large increase in time and effort, for rare increases in accuracy.

adrianN · 2025-12-30T14:54:07 1767106447

The mutation testing engine.

godelski · 2025-12-31T05:19:42 1767158382

Just add tests to test your test tests

smarx007 · 2025-12-30T10:18:35 1767089915

I think the phase change hypothesis* is a bit wrong.

I think it happens not at 100% coverage but at, say, 100% MC/DC test coverage. This is what SQLite and avionics software aim for.

*has not been confirmed by a peer-reviewed research.

cloudhead · 2025-12-30T10:43:35 1767091415

What's MC/DC?

smarx007 · 2025-12-30T11:03:40 1767092620

Modified Condition/Decision Coverage (MC/DC) is a test coverage approach that considers a chunk of code covered if:

- Every branch was "visited". Plain coverage already ensures that. I would actually advocate for 100% branch coverage before 100% line coverage.

- Every part (condition) of a branch clause has taken all possible values. If you have if(enabled && limit > 0), MC/DC requires you to test with enabled, !enabled, limit >0, limit <=0.

- Every change to the condition was shown to somehow change the outcome. (false && limit > 0) would not pass this, a change to the limit would not affect the outcome - the decision is always false. But @zweifuss has a better example.

- And, of course, every possible decision (the outcome of the entire 'enabled && limit > 0') needs to be tested. This is what ensures that every branch is taken for if statements, but also for switch statements that they are exhaustive etc.

MC/DC is usually required for all safety-critical code as per NASA, ESA, automotive (ISO 26262) and industrial (IEC 61508).

amarant · 2025-12-30T14:49:03 1767106143

Limit <=0 appears to include every number between 0 and INT_MIN.

I hope you don't have any string inputs, or your test is gonna take a while to run!

smarx007 · 2025-12-30T15:14:01 1767107641

To test 'limit > 0' according to MC/DC, you need only two values, e.g. -1 and 1. There may be other code inside the branch using limit in some other ways, prompting more test cases and more values of limit but this one only needs two.

But yes, exhaustively testing your code is a bit exhausting ;)

zweifuss · 2025-12-30T11:03:28 1767092608

Modified Condition/Decision Coverage

It's mandated by DO-178C for the highest-level (Level A) avionics software.

Example: if (A && B || C) { ... } else { ... } needs individual tests for A, B, and C.

Test #,A,B,A && B,Outcome taken,Shows independence for

1,True,True,True,if branch,(baseline true)

2,False,True,False,else branch,A (A flips outcome while B fixed at True)

3,True,False,False,else branch,B (B flips outcome while A fixed at True)

zweifuss · 2026-01-01T00:18:24 1767226704

I made a mistake:

  Test #  A      B      C      Result
  1       True   True   False  True
  2       False  True   False  False
  3       True   False  False  False
  4       False  True   True   True

emoII · 2025-12-30T10:58:30 1767092310

Basically branch coverage but also all variations of the predicates, e.g. testing both true || true, and true || false

Antibabelic · 2025-12-30T10:58:23 1767092303

https://en.wikipedia.org/wiki/Modified_condition/decision_co...

closeparen · 2025-12-30T17:40:17 1767116417

Also true of human-written unit tests. You probably also want to have integration or UI automation tests that cover the end-user scenarios in your product requirements, and invariants that are checked against large numbers of examples either taken from production (sanitized of course), in a shadow environment, or generated if you absolutely must.

theptip · 2025-12-30T16:32:31 1767112351

It’s true - but there are “good code” solutions to this already. For example, BDD / Acceptance Tests can be used to write human-readable specs.

IMO it’s quite boilerplate-y to set this up pre-LLM but probably the ROI is favorable now.

Furthermore, as Uncle Bob has written a lot about, putting effort into structuring your tests well is another area that’s usually under-invested. LLMs often write very repetitive tests, but are happy to DRY out, write factories, etc if you ask them.

notimetorelax · 2025-12-30T09:49:36 1767088176

You’re right. What I like doing in those cases is to review very closely the tests and the assertions. Frequently it’s even faster than looking at the SUT itself.

ruszki · 2025-12-30T10:28:40 1767090520

I heard this “review very closely” thing many times, and rarely means review very closely. Maybe 5% of developers really do this ever, and I probably overestimate it. When people send here AI generated code, it’s quite obvious that they don’t review code properly. There are videos when people recorded how we should use LLMs, and they clearly don’t do this.

christophilus · 2025-12-30T12:32:57 1767097977

Yeah. This is me. I try, but I always miss something. The sheer volume and occasional stupidity makes it difficult. Spot checking only gets you so far. Often, the code is excellent except in one or two truly awful spots where it does something crazy.

eternityforest · 2025-12-31T04:18:36 1767154716

Tests freeze behavior in place, and manual end to end testing can confirm that the most common paths are at least kind of correct ish.

Obviously that's not good enough, but I'd much rather have AI tests than poor test coverage.

dbdoskey · 2025-12-30T16:45:18 1767113118

In theory, that is the benefit of having an agent that is limited to only doing the tests, and an agent that only does the coding, and have them run separately, that way to fix a test, you don't change the test, etc...

eru · 2025-12-30T10:12:07 1767089527

Well, we let humans write both business logic code and tests often enough, too.

Btw, you can get a lot further in your tests, if you move away from examples, and towards properties.

jaynetics · 2025-12-30T11:19:25 1767093565

Can you give an example (pun not intended) of testing with properties?

eru · 2025-12-30T13:27:48 1767101268

https://fsharpforfunandprofit.com/series/property-based-test... is an entertaining intro.

https://hypothesis.readthedocs.io/en/latest/ are the docs for one of the best property based testing libraries available in any language.

machomaster · 2025-12-30T11:49:11 1767095351

You could mitigate that risk by using different agents (versions, companies).