Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction

Shu-wen Yang; Byeonggeun Kim; Kuan-Po Huang; Qingming Tang; Huy Phan; Bo-Ru Lu; Harsha Sundar; Shalini Ghosh; Hung-yi Lee; Chieh-Chi Kao; Chao Wang

arXiv:2507.09834·eess.AS·July 15, 2025

Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction

Shu-wen Yang, Byeonggeun Kim, Kuan-Po Huang, Qingming Tang, Huy Phan, Bo-Ru Lu, Harsha Sundar, Shalini Ghosh, Hung-yi Lee, Chieh-Chi Kao, Chao Wang

PDF

TL;DR

This paper introduces a novel continuous-valued token modeling approach for audio generation using a causal language model with token-wise diffusion, achieving significant improvements over previous discrete methods and state-of-the-art diffusion models.

Contribution

It presents a new continuous token modeling method with diffusion and a masked next-token prediction task, enhancing audio generation quality and efficiency.

Findings

01

20% and 40% improvements in FAD and KL divergence over previous methods.

02

41% and 33% FAD improvements with fewer parameters.

03

Comparable performance to state-of-the-art diffusion models.

Abstract

Autoregressive next-token prediction with the Transformer decoder has become a de facto standard in large language models (LLMs), achieving remarkable success in Natural Language Processing (NLP) at scale. Extending this paradigm to audio poses unique challenges due to its inherently continuous nature. We research audio generation with a causal language model (LM) without discrete tokens. We leverage token-wise diffusion to model the continuous distribution of the next continuous-valued token. Our approach delivers significant improvements over previous discrete solution, AudioGen, achieving 20% and 40% relative gains on AudioCaps in Frechet Audio Distance (FAD) and Kullback-Leibler (KL) divergence, respectively. Additionally, we propose a novel masked next-token prediction task that incorporates masked prediction into the causal LM framework. On AudioCaps, the innovation yields 41% and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.