Diffusion based Text-to-Music Generation with Global and Local Text based Conditioning
Jisi Zhang, Pablo Peso Parada, Md Asif Jalal, Karthikeyan Saravanan

TL;DR
This paper introduces a diffusion-based text-to-music generation model that combines local and global text representations from T5 and CLAP, improving adherence and quality while reducing model complexity.
Contribution
It proposes a novel conditioning approach using both T5 and CLAP embeddings with pooling mechanisms, reducing parameters and enhancing performance in text-to-music generation.
Findings
Incorporating CLAP embeddings improves text adherence.
Mean pooling from T5 yields better music quality.
The method reduces model parameters while maintaining performance.
Abstract
Diffusion based Text-To-Music (TTM) models generate music corresponding to text descriptions. Typically UNet based diffusion models condition on text embeddings generated from a pre-trained large language model or from a cross-modality audio-language representation model. This work proposes a diffusion based TTM, in which the UNet is conditioned on both (i) a uni-modal language model (e.g., T5) via cross-attention and (ii) a cross-modal audio-language representation model (e.g., CLAP) via Feature-wise Linear Modulation (FiLM). The diffusion model is trained to exploit both a local text representation from the T5 and a global representation from the CLAP. Furthermore, we propose modifications that extract both global and local representations from the T5 through pooling mechanisms that we call mean pooling and self-attention pooling. This approach mitigates the need for an additional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and dialogue systems
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Byte Pair Encoding · Softmax · Gated Linear Unit · SentencePiece · Residual Connection · Dropout · Linear Layer · Attention Dropout
