Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We should make a collective git repo full of every kind of annoying bug we (expert developers) can think of. Then use that to benchmark LLMs.

Someone want to start? I've got a Yjs/CRDT collaborative editing bug that took like a week and a half of attempts with Claude Code (Sonnet 4.5), GPT5-codex (medium), and GLM-4.6 many, many attempts to figure out. Even then they didn't really get it... Just came up with a successful workaround (which is good enough for me but still...).

Aside: You know what really moved the progress bar on finding and fixing the bug? When I had a moment of inspiration and made the frontend send all it's logs to the backend so the AIs could see what was actually happening on the frontend (near real-time). Really, I was just getting sick of manual testing and pasting the console output into the chat (LOL). Laziness FTW!

I have the Google Chrome Dev Tools MCP but for some reason it doesn't work as well :shrug:



Have you tried the Playwright libraries? Not the MCP, instead telling Claude Code to use the Node.js or Python Playwright libraries directly. I have had some really good results for this for gnarly frontend challenges.


Curious why not the MCP? I use that


I don't really like MCPs, at least when I'm working with coding agents like Claude Code or Codex CLI. I'd rather let the agents write code that can do anything the underlying library is capable of, rather than restricting them to just the functionality that the MCP exposes.

It's more token efficient too since I don't need to load the full MCP description into my context.


When I have a bug I’m iterating on it’s much easier and faster to have it write out the playwright script. That way it does not have to waste time or tokens performing the same actions over and over again.

Think of it as TDD.


I can't tell how much of this is sarcasm

> we (expert developers) ...

> took like a week and a half of attempts with Claude Code ...

What kind of expert developer wastes that much time prompting a bunch of different LLMs to end up with a workaround, instead of actually debugging and fixing the bug themselves?


To be charitable to the parent poster, I've had multi-week bugs that turned out to be a tiny change, where every test iteration took hours of compile time...


Fair question but I think the tone of this is a bit abrasive towards the poster, and unnecessarily so.


there is a lot of disdain for vibe coding/coders, as Im sure you already know. I was going to post something similar as soon as I read a week and a half of prompts. I pray that any gainfully employed expert coders don't spend 10 days prompting, rather than coding lol


I really don't think so. "Expert" developer really needs to mean something other than "prompting and poking at Claude Code".


> We should make a collective git repo full of every kind of annoying bug we (expert developers) can think of. Then use that to benchmark LLMs.

I think any LLM-user worth their salt have been doing this pretty much since we got API access to LLMs, as otherwise there is no way to actually see if they can solve the things you care about.

The only difference is that you must keep the actual benchmarks to yourself, don't share them with anyone and even less put them publicly. The second you do, you probably should stop using it as an actual benchmark, as newly trained LLMs will either intentionally or unintentionally slurp up your benchmark and suddenly it's no longer a good indicator.

I think I personally started keeping my own test cases for benchmarking around the GPT3 launch, when it became clear the web will be effectively "poisoned" from that part on, and anything on the public internet can be slurped up by the people feeding the LLMs training data.

Once you have this up and running, you'll get a much more measured view of how well new LLMs work, and you'll quickly see that a lot of the fanfare doesn't actually hold up when testing it against your own private benchmarks. On a happier note, you'll also be surprised when a model suddenly does a lot better in a specific area that wasn't even mentioned at release, and then you could switch to it for specifically that task :)


I actually started a collection of annoying bugs I’ve seen in the wild. I give the llm the buggy implementation and ask it to write a test that catches it. So far not even a frontier model (Claude Sonnet) can do it, even though they can find and fix the bug itself.


> even a frontier model (Claude Sonnet) can do it

Probably because Sonnet is no longer a frontier model, it isn't even the best model Anthropic offers, according to themselves.


This may be intentional, but I'd like to point out that your basically suggesting that others aggregate high-quality training data for AI companies to use free of charge to replace software engineers.


It would be pretty easy to over fit the results with a static set of tests


What was the CRDT bug?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: