OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation
Qinglin Zhang, Luyao Cheng, Chong Deng, Qian Chen, Wen Wang, Siqi, Zheng, Jiaqing Liu, Hai Yu, Chaohong Tan, Zhihao Du, Shiliang Zhang

TL;DR
OmniFlatten is an end-to-end GPT-based model designed for real-time, full-duplex spoken dialogue, effectively modeling natural conversation behaviors with low latency through a multi-stage training scheme and modality flattening.
Contribution
The paper introduces a novel multi-stage training process and a flattening operation to adapt GPT models for natural, low-latency full-duplex spoken dialogue without architecture modifications.
Findings
Effective modeling of complex conversational behaviors
Low-latency real-time speech and text generation
Unified training approach across modalities
Abstract
Full-duplex spoken dialogue systems significantly surpass traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human interactions. However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backchannels, and overlapping speech. In this paper, we introduce a novel End-to-End GPT-based model OmniFlatten for full-duplex conversation, capable of effectively modeling the complex behaviors inherent to natural conversations with low latency. To achieve full-duplex conversation capabilities, we propose a multi-stage post-training scheme that progressively adapts a text large language model (LLM) backbone into a speech-text dialogue LLM, capable of generating text and speech in real time,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Neural Networks and Applications · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Weight Decay · Multi-Head Attention · Discriminative Fine-Tuning · Layer Normalization · Byte Pair Encoding · Linear Warmup With Cosine Annealing
