Moshi: a speech-text foundation model for real-time dialogue

Alexandre D\'efossez; Laurent Mazar\'e; Manu Orsini; Am\'elie Royer,; Patrick P\'erez; Herv\'e J\'egou; Edouard Grave; Neil Zeghidour

arXiv:2410.00037·eess.AS·October 3, 2024·5 cites

Moshi: a speech-text foundation model for real-time dialogue

Alexandre D\'efossez, Laurent Mazar\'e, Manu Orsini, Am\'elie Royer,, Patrick P\'erez, Herv\'e J\'egou, Edouard Grave, Neil Zeghidour

PDF

Open Access 3 Repos 10 Models

TL;DR

Moshi is a novel speech-text foundation model enabling real-time, full-duplex spoken dialogue by generating speech directly from a language model backbone, overcoming latency and interaction limitations of traditional systems.

Contribution

Moshi introduces a unified speech-to-speech generation framework that models conversational dynamics without explicit speaker turns, improving linguistic quality and enabling real-time dialogue.

Findings

01

Achieves 160ms theoretical latency, 200ms practical latency.

02

Supports streaming speech recognition and text-to-speech.

03

First real-time full-duplex spoken large language model.

Abstract

We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning -- such as emotion or non-speech sounds -- is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Multi-Agent Systems and Negotiation