The problem is that’s not the only format they work on, and because input format is largely unconstrained, when they misunderstand, they catastrophically misunderstand.
It’s just like the image recognition ML issue, where it can correctly predict a cat, but change a specific three pixels and it has 99% confidence it’s an ostrich.
Or JavaScript equality. If you do it right, it’s right, but otherwise anything goes.