Exploring compressibility of transformer based text-to-music (TTM) models
Vasileios Moschopoulos, Thanasis Kotsiopoulos, Pablo Peso Parada,, Konstantinos Nikiforidis, Alexandros Stergiadis, Gerasimos Papakostas, Md, Asif Jalal, Jisi Zhang, Anastasios Drosou, Karthikeyan Saravanan

TL;DR
This paper analyzes how to compress large text-to-music models using knowledge distillation and modifications, creating a smaller model that maintains competitive music generation quality.
Contribution
It introduces TinyTTM, a significantly compressed TTM model, and explores trade-offs between model size and performance.
Findings
TinyTTM achieves better FAD and KL scores than larger models.
Compression methods enable deployment on resource-constrained devices.
Trade-offs between size and quality are characterized.
Abstract
State-of-the art Text-To-Music (TTM) generative AI models are large and require desktop or server class compute, making them infeasible for deployment on mobile phones. This paper presents an analysis of trade-offs between model compression and generation performance of TTM models. We study compression through knowledge distillation and specific modifications that enable applicability over the various components of the TTM model (encoder, generative model and the decoder). Leveraging these methods we create TinyTTM (89.2M params) that achieves a FAD of 3.66 and KL of 1.32 on MusicBench dataset, better than MusicGen-Small (557.6M params) but not lower than MusicGen-small fine-tuned on MusicBench.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies
MethodsKnowledge Distillation
