Resonate: Reinforcing Text-to-Audio Generation via Online Feedback from Large Audio Language Models
Xiquan Li, Junxi Liu, Wenxi Chen, Haina Zhu, Ziyang Ma, Xie Chen

TL;DR
This paper introduces Resonate, a novel online reinforcement learning approach for text-to-audio generation that leverages large audio language models, significantly improving quality and alignment over previous offline methods.
Contribution
It adapts online Group Relative Policy Optimization for TTA models and demonstrates superior performance using rewards from large audio language models.
Findings
Resonate outperforms offline RL methods in TTA quality.
Resonate achieves state-of-the-art results on TTA-Bench.
The model uses only 470M parameters.
Abstract
Reinforcement Learning (RL) has become an effective paradigm for enhancing Large Language Models (LLMs) and visual generative models. However, its application in text-to-audio (TTA) generation remains largely under-explored. Prior work typically employs offline methods like Direct Preference Optimization (DPO) and leverages Contrastive Language-Audio Pretraining (CLAP) models as reward functions. In this study, we investigate the integration of online Group Relative Policy Optimization (GRPO) into TTA generation. We adapt the algorithm for Flow Matching-based audio models and demonstrate that online RL significantly outperforms its offline counterparts. Furthermore, we incorporate rewards derived from Large Audio Language Models (LALMs), which can provide fine-grained scoring signals that are better aligned with human perception. With only 470M parameters, our final model,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis · Music and Audio Processing
