T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback

Zehan Wang; Ke Lei; Chen Zhu; Jiawei Huang; Sashuai Zhou; Luping Liu; Xize Cheng; Shengpeng Ji; Zhenhui Ye; Tao Jin; Zhou Zhao

arXiv:2505.10561·cs.SD·May 16, 2025

T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback

Zehan Wang, Ke Lei, Chen Zhu, Jiawei Huang, Sashuai Zhou, Luping Liu, Xize Cheng, Shengpeng Ji, Zhenhui Ye, Tao Jin, Zhou Zhao

PDF

Open Access 4 Reviews

TL;DR

This paper introduces AI feedback mechanisms and a large dataset to improve text-to-audio generation, enhancing model performance on complex multi-event and storytelling audio outputs.

Contribution

It presents fine-grained AI scoring pipelines, a new preference dataset, and a benchmark to significantly enhance T2A models' capabilities and alignment with human preferences.

Findings

01

AI scoring pipelines correlate better with human preferences.

02

Large dataset enables effective preference tuning.

03

Models show significant improvements on complex scenarios.

Abstract

Text-to-audio (T2A) generation has achieved remarkable progress in generating a variety of audio outputs from language prompts. However, current state-of-the-art T2A models still struggle to satisfy human preferences for prompt-following and acoustic quality when generating complex multi-event audio. To improve the performance of the model in these high-level applications, we propose to enhance the basic capabilities of the model with AI feedback learning. First, we introduce fine-grained AI audio scoring pipelines to: 1) verify whether each event in the text prompt is present in the audio (Event Occurrence Score), 2) detect deviations in event sequences from the language description (Event Sequence Score), and 3) assess the overall acoustic and harmonic quality of the generated audio (Acoustic&Harmonic Quality). We evaluate these three automatic scoring pipelines and find that they…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 5

Strengths

1. The motivation presented in the paper is very clear. Effectively evaluating the generation quality of TTA, aligning generated audio with human perceptual systems, and designing reliable evaluation sets are all crucial issues in the TTA and audio generation fields. 2. The methodology is reasonable. Event occurrence and sequence, as well as acoustics and harmonic quality, are indeed three important dimensions in the TTA problem. Utilizing AI models to assess event occurrence and sequence, whil

Weaknesses

1. My primary concern pertains to the scoring pipeline for event occurrence and sequence. In the current design, audio source separation is a critical component. From my experience, audio events in TTA datasets are often quite mixed, with multiple events potentially occurring simultaneously. The existing source separation models seem to struggle with effectively isolating various events. Furthermore, these separated results need to be accurately matched with the multiple event descriptions gener

Reviewer 02Rating 6Confidence 5

Strengths

This paper is well-formulated and clear in structure, starting from the current deficiencies in TTA models, such as event occurrence, sequence prompt-following, and quality issues. It then progresses to the construction of the scoring pipelines, the introduction of the dataset and benchmark, and finally, demonstrates how preference-tuning with the dataset improves model performance. Each AI feedback scoring metric is described in detail, making it easy to replicate. The clear presentation of qua

Weaknesses

1. The paper does not mention the impact of the validation dataset on other models, such as AudioLDM 2 or Tango 2, to ensure the dataset’s generalizability. Additionally, the benchmark has not been tested on other models, making it difficult to determine the benchmark’s discriminative power and effectiveness. 2. Lines 513 to 515 lack further analysis on why the model performs well in T2A-EpicBench’s long-text scenarios, despite T2A-Feedback focusing more on short-text and single-event descriptio

Reviewer 03Rating 5Confidence 4

Strengths

* A model-based, scalable apparoach to generate large-scale preference dataset is an important direction worth exploring, and T2A-Feedback is one of the early endeavor which warrants credit.

Weaknesses

* For evaluation and dataset papers like this, the authors can consider having more scrutiny in stating significance of the proposed metric's reliability. For example, a chi square test on the confusion matrix and reporting its p-value. Same goes to the benchmarks. * The scope of verificiation of the proposed metric's robustness is tied to AudioCaps. The readers may question the reliability of the metrics to other audio datasets across different types: Clotho and MusicCaps to name a few. * The

Reviewer 04Rating 3Confidence 5

Strengths

The paper explains all three proposed pipelines in detail and presents corresponding experiments to illustrate the advantage of the proposed scoring metrics.

Weaknesses

1. The concept of using an audio separation model to detect event occurrence is intriguing. However, relying on a CLAP-based separation model to address the limitations of the CLAP model itself seems somewhat unconvincing. 2. The rationale behind determining the event occurrence score by selecting the lowest score needs further clarification. 3. For the event sequence score, identifying the correct sequence based solely on volume levels appears challenging. Additional strategies are warranted, e

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis