Not really that weird. This isn't intended to be a "general" model. This is a coding model so they showed the coding evals. The assumption would be relative to GPT5.1, non-coding evals would be likely regress or be similar.
Like when advertising the new airliner, most people don't care about how fast it taxis.
Like when advertising the new airliner, most people don't care about how fast it taxis.