Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
Xiong Wang, Yangze Li, Chaoyou Fu, Yunhang Shen, Lei Xie, Ke Li, Xing, Sun, Long Ma

TL;DR
Freeze-Omni is a novel multimodal speech-text LLM architecture that enables low-latency speech-to-speech dialogue by freezing the backbone LLM and training a specialized speech module with limited data.
Contribution
It introduces a three-stage training strategy for speech-to-speech dialogue using frozen LLMs and minimal multi-round text data, ensuring high-quality, low-latency spoken interactions.
Findings
Achieves speech-to-speech conversation with only 60,000 multi-round text Q&A data.
Maintains comparable intelligence levels in speech and text modalities.
Enables duplex dialogue with multi-task training.
Abstract
Rapidly developing large language models (LLMs) have brought tremendous intelligent applications. Especially, the GPT-4o's excellent duplex speech interaction ability has brought impressive experience to users. Researchers have recently proposed several multi-modal LLMs in this direction that can achieve user-agent speech-to-speech conversations. This paper proposes a novel speech-text multimodal LLM architecture called Freeze-Omni. Our main contribution is that the speech input and output modalities can be easily connected to a textual LLM while keeping the LLM's parameters frozen throughout the training process. We design a three-stage training strategy for modeling both the speech input and output, enabling Freeze-Omni to obtain speech-to-speech conversation ability using text-speech paired data (such as ASR and TTS data) and only 60,000 multi-round text Q&A data on 8 GPUs. Moreover,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Topic Modeling · Speech Recognition and Synthesis
