Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation

Yuqing Cheng; Xingyu Ma; Guochen Yu; Xiaotao Gu

arXiv:2605.15831·cs.SD·May 18, 2026

Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation

Yuqing Cheng, Xingyu Ma, Guochen Yu, Xiaotao Gu

PDF

1 Repo 1 Models

TL;DR

This paper introduces BandTok, a novel 2D Mel-spectrogram tokenizer designed for music generation, which simplifies autoregressive modeling and improves reconstruction quality by representing music as a structured time-frequency grid.

Contribution

The paper presents BandTok, a new 2D tokenizer for music that enhances autoregressive generation and reconstruction, with a novel 2D positional embedding and a multi-scale training objective.

Findings

01

BandTok outperforms residual-codebook tokenizers in reconstruction quality.

02

The 2D RoPE embedding preserves temporal and frequency structure during generation.

03

Experiments demonstrate strong results in data-limited music generation scenarios.

Abstract

Autoregressive music generation depends strongly on the audio tokenizer. Existing high-fidelity codecs often use residual multi-codebook quantization, which preserves reconstruction quality but complicates language modeling after sequence flattening, as the residual hierarchy imposes strong sequential dependencies and can amplify error accumulation. We propose BandTok, a generation-oriented 2D Mel-spectrogram tokenizer that represents each frame with Mel-frequency band tokens from a single shared codebook. This design yields a physically interpretable time-frequency token grid with a more independent token structure, making it better suited for autoregressive modeling. BandTok improves reconstruction with a multi-scale PatchGAN objective and EMA codebook updates. We further introduce an autoregressive language model with 2D Rotary Position Embedding (2D RoPE) to preserve temporal and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Models

🤗
xlbhzz/bandtok-model
model· 15 dl
15 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.