TL;DR
DuoTok introduces a source-aware dual-track tokenizer for multi-track music modeling, balancing high-fidelity reconstruction and predictability, and demonstrating improved performance on standard benchmarks.
Contribution
It presents a staged disentanglement approach for dual-track tokenization, enhancing predictability and fidelity in multi-track music language models.
Findings
Achieves lowest cnBPT on benchmarks while maintaining 0.75 kbps reconstruction.
Improves enBPT under dual-track language modeling protocol.
Models trained on DuoTok tokens utilize cross-track structure and long-range context.
Abstract
Audio tokenization bridges continuous waveforms and multi-track music language models. In dual-track modeling, tokens should preserve three properties at once: high-fidelity reconstruction, strong predictability under a language model, and cross-track correspondence. We introduce DuoTok, a source-aware dual-track tokenizer that addresses this trade-off through staged disentanglement. DuoTok first pretrains a semantic encoder, then regularizes it with multi-task supervision, freezes the encoder, and applies hard dual-codebook routing while keeping auxiliary objectives on quantized codes. A diffusion decoder reconstructs high-frequency details, allowing tokens to focus on structured information for sequence modeling. On standard benchmarks, DuoTok achieves a favorable predictability-fidelity trade-off, reaching the lowest cnBPT while maintaining competitive reconstruction at 0.75 kbps.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
