Investigating Group Relative Policy Optimization for Diffusion Transformer based Text-to-Audio Generation

Yi Gu; Yanqing Liu; Chen Yang; Sheng Zhao

arXiv:2603.01565·eess.AS·March 3, 2026

Investigating Group Relative Policy Optimization for Diffusion Transformer based Text-to-Audio Generation

Yi Gu, Yanqing Liu, Chen Yang, Sheng Zhao

PDF

Open Access

TL;DR

This paper enhances text-to-audio generation by combining high-quality captioning with reinforcement learning, specifically using Group Relative Policy Optimization, to improve audio fidelity and alignment with complex prompts.

Contribution

It introduces the use of Group Relative Policy Optimization for fine-tuning diffusion transformer models in T2A generation, leveraging detailed audio captions from LLMs for better semantic alignment.

Findings

01

GRPO fine-tuning improves audio fidelity and prompt adherence

02

Reward function design significantly impacts synthesis quality

03

Using LLM-generated captions enhances semantic alignment in T2A

Abstract

Text-to-audio (T2A) generation has advanced considerably in recent years, yet existing methods continue to face challenges in accurately rendering complex text prompts, particularly those involving intricate audio effects, and achieving precise text-audio alignment. While prior approaches have explored data augmentation, explicit timing conditioning, and reinforcement learning, overall synthesis quality remains constrained. In this work, we experiment with reinforcement learning to further enhance T2A generation quality, building on diffusion transformer (DiT)-based architectures. Our method first employs a large language model (LLM) to generate high-fidelity, richly detailed audio captions, substantially improving text-audio semantic alignment, especially for ambiguous or underspecified prompts. We then apply Group Relative Policy Optimization (GRPO), a recently introduced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Speech and Audio Processing · Music and Audio Processing