OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

Qinglin Zhang; Luyao Cheng; Chong Deng; Qian Chen; Wen Wang; Siqi; Zheng; Jiaqing Liu; Hai Yu; Chaohong Tan; Zhihao Du; Shiliang Zhang

arXiv:2410.17799·cs.CL·January 6, 2025

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

Qinglin Zhang, Luyao Cheng, Chong Deng, Qian Chen, Wen Wang, Siqi, Zheng, Jiaqing Liu, Hai Yu, Chaohong Tan, Zhihao Du, Shiliang Zhang

PDF

Open Access 1 Repo 1 Video

TL;DR

OmniFlatten is an end-to-end GPT-based model designed for real-time, full-duplex spoken dialogue, effectively modeling natural conversation behaviors with low latency through a multi-stage training scheme and modality flattening.

Contribution

The paper introduces a novel multi-stage training process and a flattening operation to adapt GPT models for natural, low-latency full-duplex spoken dialogue without architecture modifications.

Findings

01

Effective modeling of complex conversational behaviors

02

Low-latency real-time speech and text generation

03

Unified training approach across modalities

Abstract

Full-duplex spoken dialogue systems significantly surpass traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human interactions. However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backchannels, and overlapping speech. In this paper, we introduce a novel End-to-End GPT-based model OmniFlatten for full-duplex conversation, capable of effectively modeling the complex behaviors inherent to natural conversations with low latency. To achieve full-duplex conversation capabilities, we propose a multi-stage post-training scheme that progressively adapts a text large language model (LLM) backbone into a speech-text dialogue LLM, capable of generating text and speech in real time,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

karpathy/nanogpt
pytorchOfficial

Videos

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Neural Networks and Applications · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Weight Decay · Multi-Head Attention · Discriminative Fine-Tuning · Layer Normalization · Byte Pair Encoding · Linear Warmup With Cosine Annealing