MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

Chung-Ming Chien; Manu Orsini; Eugene Kharitonov; Neil Zeghidour; Karen Livescu; Alexandre D\'efossez

arXiv:2604.12928·cs.CL·May 13, 2026

MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

Chung-Ming Chien, Manu Orsini, Eugene Kharitonov, Neil Zeghidour, Karen Livescu, Alexandre D\'efossez

PDF

2 Models 1 Datasets

TL;DR

MoshiRAG introduces an asynchronous retrieval-enhanced full-duplex speech language model that improves factual accuracy without sacrificing real-time conversational interactivity.

Contribution

It presents a modular, plug-and-play retrieval framework for full-duplex speech models, enhancing factuality while maintaining natural, real-time interactions.

Findings

01

Achieves factuality comparable to top non-duplex models.

02

Supports plug-and-play retrieval methods without retraining.

03

Excels in out-of-domain mathematical reasoning tasks.

Abstract

Speech-to-speech language models have recently emerged to enhance the naturalness of conversational AI. In particular, full-duplex models are distinguished by their real-time interactivity, including handling of pauses, interruptions, and backchannels. However, improving their factuality remains an open challenge. While scaling the model size could address this gap, it would make real-time inference prohibitively expensive. In this work, we propose MoshiRAG, a modular approach that combines a compact full-duplex interface with selective retrieval to access more powerful knowledge sources. Our asynchronous framework enables the model to identify knowledge-demanding queries and ground its responses in external information. By leveraging the natural temporal gap between response onset and the delivery of core information, the retrieval process can be completed while maintaining a natural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

kyutai/HaluEvalAudio_1000
dataset· 207 dl
207 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.