Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yeah, same question here.

Building pipelines for bridging LLMs and TTS and STT models with lower latency is fine and all, but when you compare to a natively multimodal model like GPT-4o it seems strictly inferior. The future is clearly voice-native models that are able to understand nuances in voice and speech patterns, and it's not exactly a distant future.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: