Investigating Group Relative Policy Optimization for Diffusion Transformer based Text-to-Audio Generation
Yi Gu, Yanqing Liu, Chen Yang, Sheng Zhao

TL;DR
This paper enhances text-to-audio generation by combining high-quality captioning with reinforcement learning, specifically using Group Relative Policy Optimization, to improve audio fidelity and alignment with complex prompts.
Contribution
It introduces the use of Group Relative Policy Optimization for fine-tuning diffusion transformer models in T2A generation, leveraging detailed audio captions from LLMs for better semantic alignment.
Findings
GRPO fine-tuning improves audio fidelity and prompt adherence
Reward function design significantly impacts synthesis quality
Using LLM-generated captions enhances semantic alignment in T2A
Abstract
Text-to-audio (T2A) generation has advanced considerably in recent years, yet existing methods continue to face challenges in accurately rendering complex text prompts, particularly those involving intricate audio effects, and achieving precise text-audio alignment. While prior approaches have explored data augmentation, explicit timing conditioning, and reinforcement learning, overall synthesis quality remains constrained. In this work, we experiment with reinforcement learning to further enhance T2A generation quality, building on diffusion transformer (DiT)-based architectures. Our method first employs a large language model (LLM) to generate high-fidelity, richly detailed audio captions, substantially improving text-audio semantic alignment, especially for ambiguous or underspecified prompts. We then apply Group Relative Policy Optimization (GRPO), a recently introduced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Speech and Audio Processing · Music and Audio Processing
