Khala: Scaling Acoustic Token Language Models Toward High-Fidelity Music Generation

Jiafeng Liu; Yuanliang Dong; Hongjia Liu; Yuqing Cheng; Zhancheng Guo; Huijing Liang; Wenbo Zhan; Yuming Sun; Xiaobing Li; Feng Yu; Maosong Sun

arXiv:2605.01790·cs.SD·May 5, 2026

Khala: Scaling Acoustic Token Language Models Toward High-Fidelity Music Generation

Jiafeng Liu, Yuanliang Dong, Hongjia Liu, Yuqing Cheng, Zhancheng Guo, Huijing Liang, Wenbo Zhan, Yuming Sun, Xiaobing Li, Feng Yu, Maosong Sun

PDF

TL;DR

This paper introduces a unified acoustic-token hierarchy for music generation, enabling high-fidelity output through a two-stage coarse-to-fine modeling process within a single deep representation space.

Contribution

It proposes a novel 64-layer residual vector quantization framework with hybrid-attention training, demonstrating that structure and detail can be modeled jointly without separate semantic stages.

Findings

01

Text-vocal alignment can emerge without a separate semantic token stage.

02

Initializing super-resolution from the backbone improves convergence and quality.

03

A fixed 62-step inference process efficiently refines music tokens.

Abstract

A common design pattern in high-quality music generation is to handle structure and fidelity in different representation spaces: a generator first models high-level structure, followed by diffusion-based or neural decoding stages that reconstruct fine details. In this work, we explore an alternative view: both may be progressively modeled within a single deep acoustic-token hierarchy. To study this, we build a 64-layer residual vector quantization (RVQ) acoustic representation and propose a two-stage coarse-to-fine generation framework. A backbone model first generates coarse acoustic tokens for the full track, and a super-resolution model then completes finer tokens within the same acoustic token space. The super-resolution stage works at full-track scale and refines tokens layer by layer while running in parallel over time, leading to a fixed 62-step inference process. To jointly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.