Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Zhifei Xie, Changqiao Wu

TL;DR
Mini-Omni is the first fully end-to-end open-source model enabling real-time speech interaction, combining speech recognition and generation with minimal latency, advancing human-computer conversational capabilities.
Contribution
The paper introduces Mini-Omni, an innovative audio-based end-to-end conversational model with a novel training method and a new dataset, enabling real-time speech interaction without relying on external TTS systems.
Findings
Mini-Omni achieves real-time speech interaction with minimal latency.
The proposed training method 'Any Model Can Talk' preserves language capabilities.
VoiceAssistant-400K dataset enhances speech output fine-tuning.
Abstract
Recent advances in language models have achieved significant progress. GPT-4o, as a new milestone, has enabled real-time conversations with humans, demonstrating near-human natural fluency. Such human-computer interaction necessitates models with the capability to perform reasoning directly with the audio modality and generate output in streaming. However, this remains beyond the reach of current academic models, as they typically depend on extra TTS systems for speech synthesis, resulting in undesirable latency. This paper introduces the Mini-Omni, an audio-based end-to-end conversational model, capable of real-time speech interaction. To achieve this capability, we propose a text-instructed speech generation method, along with batch-parallel strategies during inference to further boost the performance. Our method also helps to retain the original model's language capabilities with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
