viktorianer's comments

viktorianer · 2026-04-07T13:36:16 1775568976

The hallucinated GUIDs are a class of failure that prompt instructions will never reliably prevent. The fix that worked: regex patterns running on every file the agent produces, before anything executes.

JamesSwift · 2026-04-07T14:06:33 1775570793

Well Ive never had the issue before and have hit that / similar issues every few days over the past couple weeks.

viktorianer · 2026-04-07T13:35:36 1775568936

I went the compensation route: a set of regex patterns that check generated code for known anti-patterns before it runs. The agent can ignore a prompt telling it "don't use mocks here." It cannot ignore a failed exit code from a pattern check. Started with 40 patterns and grew to 138 over a few months as new failure modes appeared.

viktorianer · 2026-04-05T19:36:47 1775417807

The false comfort usually comes from line coverage. I had a model at 97.2% coverage, 92 tests. Ran equivalence partitioning on it (partition inputs into classes, test one from each) and found 6 real gaps: a search scope testing 1 of 5 fields, a scope checking return type but not filtering logic, a missing state machine branch. SimpleCov said the file was covered. The logical input space was not. The technique is old (ISTQB calls it specification-based testing) but the manual overhead made it impractical until recently. Agents made it possible to apply across 60+ models, which is the one thing they have changed for testing so far.

viktorianer · 2026-04-02T10:37:34 1775126254

The fix: the agent that writes code never gets to judge its own output. A separate script scans for known bad patterns, things like reduced test count or forbidden syntax. If the scan fails, the agent loops. It cannot argue with an exit code.

viktorianer · 2026-04-02T10:36:57 1775126217

Exactly right. Files on disk as the shared state, not the conversation window. Each step reads current state, does its job, writes output. Next step starts fresh from those files. No accumulated context means no drift, and the LLM can hallucinate in its reasoning all it wants as long as the output passes a check before anything advances.

viktorianer · 2026-03-30T22:28:16 1774909696

Agree completely. The middle ground between "please don't" and full sandboxing: run a validation script between agent steps. The agent writes code, a regex check catches banned patterns, the agent has to fix them before it can proceed. Sandboxing controls what the agent can do. Output validation controls what it gets to keep. Both are more reliable than prompt instructions.

viktorianer · 2026-03-29T14:06:52 1774793212

Same experience. I told my agents "never use FactoryBot, always use fixtures." They followed it most of the time. Then under pressure (complex model, lots of associations) they'd fall back to FactoryBot anyway. The agent "knew" the rule. It just didn't always follow it.

What actually worked: a Ruby script that scans the output for known bad patterns. If it finds FactoryBot in the generated code, the agent can't proceed. No reasoning, no judgment call, just a regex match and a hard stop. The agent fixes the violation, then tries again. Went from 40 patterns to 138 over 98 model runs as new failure modes showed up.

viktorianer · 2026-03-29T14:04:56 1774793096

I ran a Claude Code pipeline across 98 Rails models, 9,000 tests total. The thing that made it repeatable: each agent can only do one job. The analyzer writes a YAML plan but no Ruby. The writer gets one slice of that plan, not the whole thing. A Ruby script (not the AI) checks the output for 138 known mistakes before anything moves forward. If the check fails, the agent has to fix it.

That Ruby script is where the data comes from. I can see exactly which checks caught which problems across all 98 runs. That is not anecdotal, it is a log file.

viktorianer · 2026-03-21T10:58:58 1774090738

I ran a multi-agent system that generated tests for 60+ Rails models. One spec: 259 examples, 70% of time in factory creation, 897 factory calls. Mathematically rigorous tests that spent most of their time waiting for the database.

AI does not validate its own output. My system runs 138 regex-based checks before an agent can commit. Zero tokens, zero LLM reasoning. The same tooling turned out useful for humans too. The real gamble is using AI without deterministic quality gates between generation and commit.