Resonate: Reinforcing Text-to-Audio Generation via Online Feedback from Large Audio Language Models

Xiquan Li; Junxi Liu; Wenxi Chen; Haina Zhu; Ziyang Ma; Xie Chen

arXiv:2603.11661·cs.SD·March 13, 2026

Resonate: Reinforcing Text-to-Audio Generation via Online Feedback from Large Audio Language Models

Xiquan Li, Junxi Liu, Wenxi Chen, Haina Zhu, Ziyang Ma, Xie Chen

PDF

Open Access 1 Models

TL;DR

This paper introduces Resonate, a novel online reinforcement learning approach for text-to-audio generation that leverages large audio language models, significantly improving quality and alignment over previous offline methods.

Contribution

It adapts online Group Relative Policy Optimization for TTA models and demonstrates superior performance using rewards from large audio language models.

Findings

01

Resonate outperforms offline RL methods in TTA quality.

02

Resonate achieves state-of-the-art results on TTA-Bench.

03

The model uses only 470M parameters.

Abstract

Reinforcement Learning (RL) has become an effective paradigm for enhancing Large Language Models (LLMs) and visual generative models. However, its application in text-to-audio (TTA) generation remains largely under-explored. Prior work typically employs offline methods like Direct Preference Optimization (DPO) and leverages Contrastive Language-Audio Pretraining (CLAP) models as reward functions. In this study, we investigate the integration of online Group Relative Policy Optimization (GRPO) into TTA generation. We adapt the algorithm for Flow Matching-based audio models and demonstrate that online RL significantly outperforms its offline counterparts. Furthermore, we incorporate rewards derived from Large Audio Language Models (LALMs), which can provide fine-grained scoring signals that are better aligned with human perception. With only 470M parameters, our final model,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
AndreasXi/Resonate
model· 71 dl· ♡ 4
71 dl♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis · Music and Audio Processing