BATON: Aligning Text-to-Audio Model with Human Preference Feedback

Huan Liao; Haonan Han; Kai Yang; Tianjiao Du; Rui Yang; Zunnan Xu,; Qinmei Xu; Jingquan Liu; Jiasheng Lu; Xiu Li

arXiv:2402.00744·cs.SD·February 2, 2024·1 cites

BATON: Aligning Text-to-Audio Model with Human Preference Feedback

Huan Liao, Haonan Han, Kai Yang, Tianjiao Du, Rui Yang, Zunnan Xu,, Qinmei Xu, Jingquan Liu, Jiasheng Lu, Xiu Li

PDF

Open Access

TL;DR

BATON is a framework that improves text-to-audio generation by using human preference feedback to fine-tune models, resulting in more aligned and higher-quality audio outputs.

Contribution

This paper introduces BATON, a novel three-stage framework that leverages human feedback to enhance text-to-audio model alignment and quality.

Findings

01

Significant improvement in audio quality and human preference alignment.

02

Effective use of a curated dataset with human annotations.

03

Enhanced model performance in terms of audio integrity and temporal coherence.

Abstract

With the development of AI-Generated Content (AIGC), text-to-audio models are gaining widespread attention. However, it is challenging for these models to generate audio aligned with human preference due to the inherent information density of natural language and limited model understanding ability. To alleviate this issue, we formulate the BATON, a framework designed to enhance the alignment between generated audio and text prompt using human preference feedback. Our BATON comprises three key stages: Firstly, we curated a dataset containing both prompts and the corresponding generated audio, which was then annotated based on human feedback. Secondly, we introduced a reward model using the constructed dataset, which can mimic human preference by assigning rewards to input text-audio pairs. Finally, we employed the reward model to fine-tune an off-the-shelf text-to-audio model. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing