VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

Zuwei Long; Yunhang Shen; Chaoyou Fu; Heting Gao; Lijiang Li; Peixian Chen; Mengdan Zhang; Hang Shao; Jian Li; Jinlong Peng; Haoyu Cao; Ke Li; Rongrong Ji; Xing Sun

arXiv:2505.03739·cs.CL·October 22, 2025

VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

Zuwei Long, Yunhang Shen, Chaoyou Fu, Heting Gao, Lijiang Li, Peixian Chen, Mengdan Zhang, Hang Shao, Jian Li, Jinlong Peng, Haoyu Cao, Ke Li, Rongrong Ji, Xing Sun

PDF

Open Access 1 Repo

TL;DR

VITA-Audio introduces a fast, multi-modal large speech model that significantly reduces latency in audio token generation, enabling real-time speech applications with 3-5x faster inference and strong benchmark performance.

Contribution

The paper presents a novel multi-modal large speech model with a lightweight token prediction module and a progressive training strategy for real-time audio generation.

Findings

01

Achieves 3-5x inference speedup at 7B parameters.

02

Outperforms similar models on ASR, TTS, and SQA benchmarks.

03

Enables real-time conversational speech applications.

Abstract

With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vita-mllm/vita-audio
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsSoftmax · Attention Is All You Need