Moshi: a speech-text foundation model for real-time dialogue
Alexandre D\'efossez, Laurent Mazar\'e, Manu Orsini, Am\'elie Royer,, Patrick P\'erez, Herv\'e J\'egou, Edouard Grave, Neil Zeghidour

TL;DR
Moshi is a novel speech-text foundation model enabling real-time, full-duplex spoken dialogue by generating speech directly from a language model backbone, overcoming latency and interaction limitations of traditional systems.
Contribution
Moshi introduces a unified speech-to-speech generation framework that models conversational dynamics without explicit speaker turns, improving linguistic quality and enabling real-time dialogue.
Findings
Achieves 160ms theoretical latency, 200ms practical latency.
Supports streaming speech recognition and text-to-speech.
First real-time full-duplex spoken large language model.
Abstract
We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning -- such as emotion or non-speech sounds -- is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/personaplex-7b-v1model· 324k dl· ♡ 2332324k dl♡ 2332
- 🤗11mlabs/indri-0.1-350m-ttsmodel· 34 dl· ♡ 334 dl♡ 3
- 🤗nu-dialogue/j-moshimodel· 162 dl· ♡ 15162 dl♡ 15
- 🤗kyutai/stt-2.6b-enmodel· ♡ 120♡ 120
- 🤗kyutai/stt-2.6b-en-mlxmodel· ♡ 8♡ 8
- 🤗FluidInference/pocket-tts-coremlmodel· 720 dl· ♡ 1720 dl♡ 1
- 🤗kyutai/hibiki-zero-3b-pytorch-bf16model· 702 dl· ♡ 45702 dl♡ 45
- 🤗JoshTalksAI/Human-1model· 67 dl· ♡ 267 dl♡ 2
- 🤗11mlabs/indri-0.1-124m-ttsmodel· 143 dl· ♡ 9143 dl♡ 9
- 🤗nu-dialogue/j-moshi-extmodel· 761 dl· ♡ 46761 dl♡ 46
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Multi-Agent Systems and Negotiation
