Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model   with Frozen LLM

Xiong Wang; Yangze Li; Chaoyou Fu; Yunhang Shen; Lei Xie; Ke Li; Xing; Sun; Long Ma

arXiv:2411.00774·cs.SD·December 10, 2024·2 cites

Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

Xiong Wang, Yangze Li, Chaoyou Fu, Yunhang Shen, Lei Xie, Ke Li, Xing, Sun, Long Ma

PDF

Open Access

TL;DR

Freeze-Omni is a novel multimodal speech-text LLM architecture that enables low-latency speech-to-speech dialogue by freezing the backbone LLM and training a specialized speech module with limited data.

Contribution

It introduces a three-stage training strategy for speech-to-speech dialogue using frozen LLMs and minimal multi-round text data, ensuring high-quality, low-latency spoken interactions.

Findings

01

Achieves speech-to-speech conversation with only 60,000 multi-round text Q&A data.

02

Maintains comparable intelligence levels in speech and text modalities.

03

Enables duplex dialogue with multi-task training.

Abstract

Rapidly developing large language models (LLMs) have brought tremendous intelligent applications. Especially, the GPT-4o's excellent duplex speech interaction ability has brought impressive experience to users. Researchers have recently proposed several multi-modal LLMs in this direction that can achieve user-agent speech-to-speech conversations. This paper proposes a novel speech-text multimodal LLM architecture called Freeze-Omni. Our main contribution is that the speech input and output modalities can be easily connected to a textual LLM while keeping the LLM's parameters frozen throughout the training process. We design a three-stage training strategy for modeling both the speech input and output, enabling Freeze-Omni to obtain speech-to-speech conversation ability using text-speech paired data (such as ASR and TTS data) and only 60,000 multi-round text Q&A data on 8 GPUs. Moreover,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Speech Recognition and Synthesis