Then you need a lot of people that listen to those 12B hours of audio, and multiple listeners agree for each chunk of audio that what is spoken corresponds to the transcript.
Yes, but then you don't need Mozilla collecting read speech samples. You can just scrape any audio out there, run speech activity detection, and there you go.