Quality-aware Masked Diffusion Transformer for Enhanced Music Generation
Chang Li, Ruoyu Wang, Lijuan Liu, Jun Du, Yixuan Sun, Zilu Guo, Zhenrong Zhang, Yuan Jiang, Jianqing Gao, Feng Ma

TL;DR
This paper introduces a quality-aware masked diffusion transformer for text-to-music generation, improving quality and musicality using a novel training paradigm, latent space properties, and caption refinement, achieving state-of-the-art results.
Contribution
The paper presents a new quality-aware training method and a masked diffusion transformer model tailored for high-quality music generation from imbalanced datasets.
Findings
Achieves state-of-the-art performance on MusicCaps and Song-Describer datasets.
Demonstrates improved musicality and quality control in generated music.
Provides open-source code and pretrained models for reproducibility.
Abstract
Text-to-music (TTM) generation, which converts textual descriptions into audio, opens up innovative avenues for multimedia creation. Achieving high quality and diversity in this process demands extensive, high-quality data, which are often scarce in available datasets. Most open-source datasets frequently suffer from issues like low-quality waveforms and low text-audio consistency, hindering the advancement of music generation models. To address these challenges, we propose a novel quality-aware training paradigm for generating high-quality, high-musicality music from large-scale, quality-imbalanced datasets. Additionally, by leveraging unique properties in the latent space of musical signals, we adapt and implement a masked diffusion transformer (MDT) model for the TTM task, showcasing its capacity for quality control and enhanced musicality. Furthermore, we introduce a three-stage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Music and Audio Processing · Speech and Audio Processing
MethodsDiffusion
