TL;DR
This paper introduces BandTok, a novel 2D Mel-spectrogram tokenizer designed for music generation, which simplifies autoregressive modeling and improves reconstruction quality by representing music as a structured time-frequency grid.
Contribution
The paper presents BandTok, a new 2D tokenizer for music that enhances autoregressive generation and reconstruction, with a novel 2D positional embedding and a multi-scale training objective.
Findings
BandTok outperforms residual-codebook tokenizers in reconstruction quality.
The 2D RoPE embedding preserves temporal and frequency structure during generation.
Experiments demonstrate strong results in data-limited music generation scenarios.
Abstract
Autoregressive music generation depends strongly on the audio tokenizer. Existing high-fidelity codecs often use residual multi-codebook quantization, which preserves reconstruction quality but complicates language modeling after sequence flattening, as the residual hierarchy imposes strong sequential dependencies and can amplify error accumulation. We propose BandTok, a generation-oriented 2D Mel-spectrogram tokenizer that represents each frame with Mel-frequency band tokens from a single shared codebook. This design yields a physically interpretable time-frequency token grid with a more independent token structure, making it better suited for autoregressive modeling. BandTok improves reconstruction with a multi-scale PatchGAN objective and EMA codebook updates. We further introduce an autoregressive language model with 2D Rotary Position Embedding (2D RoPE) to preserve temporal and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
