I’m convinced as well. There are plenty of folks using it for production use cases who regularly run evaluations, including myself. No evidence that it has been nerfed.
It’s just not as robust and general as it felt when using it for the first time.
Folks using it in production are using the API, where you can explicitly select a model. In the post, the poster is using ChatGPT, through the web app, where you can't select exact model and they sometimes do updates.
The UI also has a system prompt you don’t control which could be changed without model changes, and may have other differences from using the API directly (and those differences may also change over time.)
It’s just not as robust and general as it felt when using it for the first time.