VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction

Yuhao Wang; Ziyang Cheng; Heyang Liu; Ronghua Wu; Qunshan Gu; Yanfeng Wang; Yu Wang

arXiv:2511.10232·cs.CL·November 14, 2025

VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction

Yuhao Wang, Ziyang Cheng, Heyang Liu, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang

PDF

Open Access

TL;DR

VocalNet-M2 introduces a low-latency spoken language model that uses multi-codebook tokenization and multi-token prediction to significantly reduce response delay while maintaining high performance for real-time speech applications.

Contribution

It presents a novel integrated multi-codebook tokenizer and multi-token prediction strategy that reduces latency and improves efficiency in end-to-end spoken language models.

Findings

01

Reduced first chunk latency from 725ms to 350ms

02

Maintained competitive performance with mainstream SLMs

03

Provided insights into multi-codebook strategies for real-time applications

Abstract

Current end-to-end spoken language models (SLMs) have made notable progress, yet they still encounter considerable response latency. This delay primarily arises from the autoregressive generation of speech tokens and the reliance on complex flow-matching models for speech synthesis. To overcome this, we introduce VocalNet-M2, a novel low-latency SLM that integrates a multi-codebook tokenizer and a multi-token prediction (MTP) strategy. Our model directly generates multi-codebook speech tokens, thus eliminating the need for a latency-inducing flow-matching model. Furthermore, our MTP strategy enhances generation efficiency and improves overall performance. Extensive experiments demonstrate that VocalNet-M2 achieves a substantial reduction in first chunk latency (from approximately 725ms to 350ms) while maintaining competitive performance across mainstream SLMs. This work also provides a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques