Quality-aware Masked Diffusion Transformer for Enhanced Music Generation

Chang Li; Ruoyu Wang; Lijuan Liu; Jun Du; Yixuan Sun; Zilu Guo; Zhenrong Zhang; Yuan Jiang; Jianqing Gao; Feng Ma

arXiv:2405.15863·cs.SD·June 18, 2025

Quality-aware Masked Diffusion Transformer for Enhanced Music Generation

Chang Li, Ruoyu Wang, Lijuan Liu, Jun Du, Yixuan Sun, Zilu Guo, Zhenrong Zhang, Yuan Jiang, Jianqing Gao, Feng Ma

PDF

Open Access 2 Repos 2 Models

TL;DR

This paper introduces a quality-aware masked diffusion transformer for text-to-music generation, improving quality and musicality using a novel training paradigm, latent space properties, and caption refinement, achieving state-of-the-art results.

Contribution

The paper presents a new quality-aware training method and a masked diffusion transformer model tailored for high-quality music generation from imbalanced datasets.

Findings

01

Achieves state-of-the-art performance on MusicCaps and Song-Describer datasets.

02

Demonstrates improved musicality and quality control in generated music.

03

Provides open-source code and pretrained models for reproducibility.

Abstract

Text-to-music (TTM) generation, which converts textual descriptions into audio, opens up innovative avenues for multimedia creation. Achieving high quality and diversity in this process demands extensive, high-quality data, which are often scarce in available datasets. Most open-source datasets frequently suffer from issues like low-quality waveforms and low text-audio consistency, hindering the advancement of music generation models. To address these challenges, we propose a novel quality-aware training paradigm for generating high-quality, high-musicality music from large-scale, quality-imbalanced datasets. Additionally, by leveraging unique properties in the latent space of musical signals, we adapt and implement a masked diffusion transformer (MDT) model for the TTM task, showcasing its capacity for quality control and enhanced musicality. Furthermore, we introduce a three-stage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Music and Audio Processing · Speech and Audio Processing

MethodsDiffusion