MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models
Chung-Ming Chien, Manu Orsini, Eugene Kharitonov, Neil Zeghidour, Karen Livescu, Alexandre D\'efossez

TL;DR
MoshiRAG introduces an asynchronous retrieval-enhanced full-duplex speech language model that improves factual accuracy without sacrificing real-time conversational interactivity.
Contribution
It presents a modular, plug-and-play retrieval framework for full-duplex speech models, enhancing factuality while maintaining natural, real-time interactions.
Findings
Achieves factuality comparable to top non-duplex models.
Supports plug-and-play retrieval methods without retraining.
Excels in out-of-domain mathematical reasoning tasks.
Abstract
Speech-to-speech language models have recently emerged to enhance the naturalness of conversational AI. In particular, full-duplex models are distinguished by their real-time interactivity, including handling of pauses, interruptions, and backchannels. However, improving their factuality remains an open challenge. While scaling the model size could address this gap, it would make real-time inference prohibitively expensive. In this work, we propose MoshiRAG, a modular approach that combines a compact full-duplex interface with selective retrieval to access more powerful knowledge sources. Our asynchronous framework enables the model to identify knowledge-demanding queries and ground its responses in external information. By leveraging the natural temporal gap between response onset and the delivery of core information, the retrieval process can be completed while maintaining a natural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
