I've tried many different models and without doubt the code coming out of them differs a lot when it comes to "quality". Some of that is subjective for sure, but there are objective sides to "good" code.
I wish this was a metric for the AI benchmarks so I could choose a model based on this, because honestly it's one of the things I care most about.
Problem: How can you measure such things, whats the metrcis?
...maybe there just isn't a way to do it, since that metric isn't in the charts..
Code is derivative - it's modeling real behavior. So its quality depends closely on how well it captures what should actually happen.
That's why measuring the actual outcome is more important than raw "code quality" metrics: do the important user flows and edge cases work, how the system behaves in these edge cases. I'd more use something like Journey SDK to fuzz edge cases and measure how well the system behaves, rather than measure some arbitrary properties of the code.
reply