Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A simple correlation of audio chunks from microphone and from the TTS should be enough to tell which parts in the input stream are re-recorded TTS. Much simpler, no?


It's not so simple when the impulse response of the room and mic and speakers are all unknown, possibly changing, plus unknown background sounds as well, possibly at a very high level. and there's also unknown latency which can be quite large especially in the networked case, and maybe some codecs, and maybe some audio "enhancement" software the OEM installed on the user's machine. Also, ideally the computer would be able to hear the user even while it is speaking.

Echo cancellation is non-trivial for sure.


Could it be done more reasonably with the transcription? With diarisation the logic of "someone is saying exactly / almost exactly what I'm saying, that's probably me" might be pretty reasonable.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: