MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners

Fang-Duo Tsai; Shih-Lun Wu; Weijaw Lee; Sheng-Ping Yang; Bo-Rui Chen; Hao-Chung Cheng; Yi-Hsuan Yang

arXiv:2506.18729·cs.SD·June 25, 2025

MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners

Fang-Duo Tsai, Shih-Lun Wu, Weijaw Lee, Sheng-Ping Yang, Bo-Rui Chen, Hao-Chung Cheng, Yi-Hsuan Yang

PDF

1 Models

TL;DR

MuseControlLite introduces a lightweight, efficient method for fine-tuning text-to-music models with precise control over musical attributes, significantly reducing training costs while improving controllability.

Contribution

It demonstrates the importance of positional embeddings in time-dependent conditioning and achieves enhanced control with fewer trainable parameters.

Findings

01

Adding rotary positional embeddings improves control accuracy.

02

Fewer trainable parameters than state-of-the-art methods.

03

Effective control over musical attributes with low fine-tuning cost.

Abstract

We propose MuseControlLite, a lightweight mechanism designed to fine-tune text-to-music generation models for precise conditioning using various time-varying musical attributes and reference audio signals. The key finding is that positional embeddings, which have been seldom used by text-to-music generation models in the conditioner for text conditions, are critical when the condition of interest is a function of time. Using melody control as an example, our experiments show that simply adding rotary positional embeddings to the decoupled cross-attention layers increases control accuracy from 56.6% to 61.1%, while requiring 6.75 times fewer trainable parameters than state-of-the-art fine-tuning mechanisms, using the same pre-trained diffusion Transformer model of Stable Audio Open. We evaluate various forms of musical attribute control, audio inpainting, and audio outpainting,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
fundwotsai2001/Text-to-Music_control_family
model· 8 dl· ♡ 5
8 dl♡ 5

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLayer Normalization · Dropout · Absolute Position Encodings · Dense Connections · Byte Pair Encoding · Softmax · Label Smoothing · Transformer · Diffusion