Building pipelines for bridging LLMs and TTS and STT models with lower latency is fine and all, but when you compare to a natively multimodal model like GPT-4o it seems strictly inferior. The future is clearly voice-native models that are able to understand nuances in voice and speech patterns, and it's not exactly a distant future.
Building pipelines for bridging LLMs and TTS and STT models with lower latency is fine and all, but when you compare to a natively multimodal model like GPT-4o it seems strictly inferior. The future is clearly voice-native models that are able to understand nuances in voice and speech patterns, and it's not exactly a distant future.