Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

Zhifei Xie; Changqiao Wu

arXiv:2408.16725·cs.AI·November 6, 2024·2 cites

Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

Zhifei Xie, Changqiao Wu

PDF

Open Access 1 Repo 2 Models

TL;DR

Mini-Omni is the first fully end-to-end open-source model enabling real-time speech interaction, combining speech recognition and generation with minimal latency, advancing human-computer conversational capabilities.

Contribution

The paper introduces Mini-Omni, an innovative audio-based end-to-end conversational model with a novel training method and a new dataset, enabling real-time speech interaction without relying on external TTS systems.

Findings

01

Mini-Omni achieves real-time speech interaction with minimal latency.

02

The proposed training method 'Any Model Can Talk' preserves language capabilities.

03

VoiceAssistant-400K dataset enhances speech output fine-tuning.

Abstract

Recent advances in language models have achieved significant progress. GPT-4o, as a new milestone, has enabled real-time conversations with humans, demonstrating near-human natural fluency. Such human-computer interaction necessitates models with the capability to perform reasoning directly with the audio modality and generate output in streaming. However, this remains beyond the reach of current academic models, as they typically depend on extra TTS systems for speech synthesis, resulting in undesirable latency. This paper introduces the Mini-Omni, an audio-based end-to-end conversational model, capable of real-time speech interaction. To achieve this capability, we propose a text-instructed speech generation method, along with batch-parallel strategies during inference to further boost the performance. Our method also helps to retain the original model's language capabilities with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gpt-omni/mini-omni
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems