An Independence-promoting Loss for Music Generation with Language Models
Jean-Marie Lemercier, Simon Rouard, Jade Copet, Yossi Adi and, Alexandre D\'efossez

TL;DR
This paper introduces an independence-promoting loss for music generation with language models, reducing dependence between codebooks to improve quality and speed, especially when modeling marginal distributions.
Contribution
It proposes a novel mutual information proxy loss based on maximum mean discrepancy to regularize auto-encoders in multi-codebook music tokenization.
Findings
Reduces statistical dependence between codebooks during auto-encoding.
Improves music generation quality when modeling marginal distributions.
Enables faster audio generation compared to joint distribution modeling.
Abstract
Music generation schemes using language modeling rely on a vocabulary of audio tokens, generally provided as codes in a discrete latent space learnt by an auto-encoder. Multi-stage quantizers are often employed to produce these tokens, therefore the decoding strategy used for token prediction must be adapted to account for multiple codebooks: either it should model the joint distribution over all codebooks, or fit the product of the codebook marginal distributions. Modelling the joint distribution requires a costly increase in the number of auto-regressive steps, while fitting the product of the marginals yields an inexact model unless the codebooks are mutually independent. In this work, we introduce an independence-promoting loss to regularize the auto-encoder used as the tokenizer in language models for music generation. The proposed loss is a proxy for mutual information based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing
