DuoTok: Source-Aware Dual-Track Tokenization for Multi-Track Music Language Modeling

Rui Lin; Zhiyue Wu; Jiahe Le; Kangdi Wang; Weixiong Chen; Junyu Dai; Tao Jiang

arXiv:2511.20224·cs.SD·April 2, 2026

DuoTok: Source-Aware Dual-Track Tokenization for Multi-Track Music Language Modeling

Rui Lin, Zhiyue Wu, Jiahe Le, Kangdi Wang, Weixiong Chen, Junyu Dai, Tao Jiang

PDF

1 Repo

TL;DR

DuoTok introduces a source-aware dual-track tokenizer for multi-track music modeling, balancing high-fidelity reconstruction and predictability, and demonstrating improved performance on standard benchmarks.

Contribution

It presents a staged disentanglement approach for dual-track tokenization, enhancing predictability and fidelity in multi-track music language models.

Findings

01

Achieves lowest cnBPT on benchmarks while maintaining 0.75 kbps reconstruction.

02

Improves enBPT under dual-track language modeling protocol.

03

Models trained on DuoTok tokens utilize cross-track structure and long-range context.

Abstract

Audio tokenization bridges continuous waveforms and multi-track music language models. In dual-track modeling, tokens should preserve three properties at once: high-fidelity reconstruction, strong predictability under a language model, and cross-track correspondence. We introduce DuoTok, a source-aware dual-track tokenizer that addresses this trade-off through staged disentanglement. DuoTok first pretrains a semantic encoder, then regularizes it with multi-task supervision, freezes the encoder, and applies hard dual-codebook routing while keeping auxiliary objectives on quantized codes. A diffusion decoder reconstructs high-frequency details, allowing tokens to focus on structured information for sequence modeling. On standard benchmarks, DuoTok achieves a favorable predictability-fidelity trade-off, reaching the lowest cnBPT while maintaining competitive reconstruction at 0.75 kbps.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eps-acoustic-revolution-lab/DUO_TOK
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.