I can be completely off base, but it feels to me like benchmaxxing is going on w...

Bjorkbat · 2025-05-16T23:36:28 1747438588

I kind of had the feeling LLMs would be better at Python vs other languages, but wow, the difference on Multi SWE is pretty crazy.

kristianp · 2025-05-17T22:08:41 1747519721

Maybe a lot of the difference we see between peoples comments about how useful AI is for their coding, is a function of what language they're using. Python coders may love it, Go coders not much at all.

ofirpress · 2025-05-17T00:46:16 1747442776

Not sure what you mean by benchmaxxing but we think there's still a lot of useful signals you can infer from SWE-bench-style benchmarking.

We also have SWE-bench Multimodal which adds a twist I haven't seen elsewhere: https://www.swebench.com/multimodal.html

Snuggly73 · 2025-05-17T02:35:11 1747449311

I mean that there is the possibility that swe bench is being specifically targeted for training and the results may not reflect real world performance.