We should make a collective git repo full of every kind of annoying bug we (expe...

simonw · 2025-11-08T18:41:41 1762627301

Have you tried the Playwright libraries? Not the MCP, instead telling Claude Code to use the Node.js or Python Playwright libraries directly. I have had some really good results for this for gnarly frontend challenges.

kvirani · 2025-11-08T18:50:58 1762627858

Curious why not the MCP? I use that

simonw · 2025-11-08T19:30:58 1762630258

I don't really like MCPs, at least when I'm working with coding agents like Claude Code or Codex CLI. I'd rather let the agents write code that can do anything the underlying library is capable of, rather than restricting them to just the functionality that the MCP exposes.

It's more token efficient too since I don't need to load the full MCP description into my context.

s900mhz · 2025-11-08T18:57:33 1762628253

When I have a bug I’m iterating on it’s much easier and faster to have it write out the playwright script. That way it does not have to waste time or tokens performing the same actions over and over again.

Think of it as TDD.

bogwog · 2025-11-08T18:43:41 1762627421

I can't tell how much of this is sarcasm

> we (expert developers) ...

> took like a week and a half of attempts with Claude Code ...

What kind of expert developer wastes that much time prompting a bunch of different LLMs to end up with a workaround, instead of actually debugging and fixing the bug themselves?

Terr_ · 2025-11-09T00:24:03 1762647843

To be charitable to the parent poster, I've had multi-week bugs that turned out to be a tiny change, where every test iteration took hours of compile time...

kvirani · 2025-11-08T18:50:13 1762627813

Fair question but I think the tone of this is a bit abrasive towards the poster, and unnecessarily so.

topato · 2025-11-08T19:09:02 1762628942

there is a lot of disdain for vibe coding/coders, as Im sure you already know. I was going to post something similar as soon as I read a week and a half of prompts. I pray that any gainfully employed expert coders don't spend 10 days prompting, rather than coding lol

ambicapter · 2025-11-08T21:53:00 1762638780

I really don't think so. "Expert" developer really needs to mean something other than "prompting and poking at Claude Code".

embedding-shape · 2025-11-08T19:20:09 1762629609

> We should make a collective git repo full of every kind of annoying bug we (expert developers) can think of. Then use that to benchmark LLMs.

I think any LLM-user worth their salt have been doing this pretty much since we got API access to LLMs, as otherwise there is no way to actually see if they can solve the things you care about.

The only difference is that you must keep the actual benchmarks to yourself, don't share them with anyone and even less put them publicly. The second you do, you probably should stop using it as an actual benchmark, as newly trained LLMs will either intentionally or unintentionally slurp up your benchmark and suddenly it's no longer a good indicator.

I think I personally started keeping my own test cases for benchmarking around the GPT3 launch, when it became clear the web will be effectively "poisoned" from that part on, and anything on the public internet can be slurped up by the people feeding the LLMs training data.

Once you have this up and running, you'll get a much more measured view of how well new LLMs work, and you'll quickly see that a lot of the fanfare doesn't actually hold up when testing it against your own private benchmarks. On a happier note, you'll also be surprised when a model suddenly does a lot better in a specific area that wasn't even mentioned at release, and then you could switch to it for specifically that task :)

atn34 · 2025-11-08T19:10:25 1762629025

I actually started a collection of annoying bugs I’ve seen in the wild. I give the llm the buggy implementation and ask it to write a test that catches it. So far not even a frontier model (Claude Sonnet) can do it, even though they can find and fix the bug itself.

embedding-shape · 2025-11-08T19:29:00 1762630140

> even a frontier model (Claude Sonnet) can do it

Probably because Sonnet is no longer a frontier model, it isn't even the best model Anthropic offers, according to themselves.

throw10920 · 2025-11-08T20:23:25 1762633405

This may be intentional, but I'd like to point out that your basically suggesting that others aggregate high-quality training data for AI companies to use free of charge to replace software engineers.

cortesoft · 2025-11-08T18:42:48 1762627368

It would be pretty easy to over fit the results with a static set of tests

n_u · 2025-11-08T22:14:12 1762640052

What was the CRDT bug?