Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I can be completely off base, but it feels to me like benchmaxxing is going on with swe-bench.

Look at the results from multi swe bench - https://multi-swe-bench.github.io/#/

swe polybench - https://amazon-science.github.io/SWE-PolyBench/

Kotlin bench - https://firebender.com/leaderboard



I kind of had the feeling LLMs would be better at Python vs other languages, but wow, the difference on Multi SWE is pretty crazy.


Maybe a lot of the difference we see between peoples comments about how useful AI is for their coding, is a function of what language they're using. Python coders may love it, Go coders not much at all.


Not sure what you mean by benchmaxxing but we think there's still a lot of useful signals you can infer from SWE-bench-style benchmarking.

We also have SWE-bench Multimodal which adds a twist I haven't seen elsewhere: https://www.swebench.com/multimodal.html


I mean that there is the possibility that swe bench is being specifically targeted for training and the results may not reflect real world performance.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: