I tried feeding the four examples from this announcement into Google as dictation inputs and it just sits there blankly. On the JFK speech test file in the repo, Google understands perfectly. The samples in the announcement are clearly outside the capabilities of anything Google has launched publicly, but I don't know how that translates to overall utility in every day applications.
My experience with the APIs is Google is excellent and Microsoft is slightly better. And the offline model I've been using that's nearly as good as both is facebook's wav2vec2-large-960h-lv60-self.
Don't believe what's on marketing pages, they rarely transfer to the real world. Will have to make time to try it and see. In theory, given task diversity and sheer number of hours, it should be a lot more robust but will wait on evidence before believing any claims on SoTA.