Diffusion based Text-to-Music Generation with Global and Local Text   based Conditioning

Jisi Zhang; Pablo Peso Parada; Md Asif Jalal; Karthikeyan Saravanan

arXiv:2501.14680·eess.AS·January 28, 2025

Diffusion based Text-to-Music Generation with Global and Local Text based Conditioning

Jisi Zhang, Pablo Peso Parada, Md Asif Jalal, Karthikeyan Saravanan

PDF

Open Access

TL;DR

This paper introduces a diffusion-based text-to-music generation model that combines local and global text representations from T5 and CLAP, improving adherence and quality while reducing model complexity.

Contribution

It proposes a novel conditioning approach using both T5 and CLAP embeddings with pooling mechanisms, reducing parameters and enhancing performance in text-to-music generation.

Findings

01

Incorporating CLAP embeddings improves text adherence.

02

Mean pooling from T5 yields better music quality.

03

The method reduces model parameters while maintaining performance.

Abstract

Diffusion based Text-To-Music (TTM) models generate music corresponding to text descriptions. Typically UNet based diffusion models condition on text embeddings generated from a pre-trained large language model or from a cross-modality audio-language representation model. This work proposes a diffusion based TTM, in which the UNet is conditioned on both (i) a uni-modal language model (e.g., T5) via cross-attention and (ii) a cross-modal audio-language representation model (e.g., CLAP) via Feature-wise Linear Modulation (FiLM). The diffusion model is trained to exploit both a local text representation from the T5 and a global representation from the CLAP. Furthermore, we propose modifications that extract both global and local representations from the T5 through pooling mechanisms that we call mean pooling and self-attention pooling. This approach mitigates the need for an additional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and dialogue systems

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Byte Pair Encoding · Softmax · Gated Linear Unit · SentencePiece · Residual Connection · Dropout · Linear Layer · Attention Dropout