PPPR: Portable Plug-in Prompt Refiner for Text to Audio Generation

Shuchen Shi; Ruibo Fu; Zhengqi Wen; Jianhua Tao; Tao Wang; Chunyu; Qiang; Yi Lu; Xin Qi; Xuefei Liu; Yukun Liu; Yongwei Li; Zhiyong Wang,; Xiaopeng Wang

arXiv:2406.04683·cs.SD·June 10, 2024

PPPR: Portable Plug-in Prompt Refiner for Text to Audio Generation

Shuchen Shi, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Tao Wang, Chunyu, Qiang, Yi Lu, Xin Qi, Xuefei Liu, Yukun Liu, Yongwei Li, Zhiyong Wang,, Xiaopeng Wang

PDF

Open Access

TL;DR

This paper introduces PPPR, a portable plug-in prompt refiner that leverages large language models to improve text-to-audio generation robustness and accuracy without retraining, achieving state-of-the-art results.

Contribution

The paper presents a novel plug-in prompt refiner that enhances TTA models using LLMs and a Chain-of-Thought verification process, without modifying training data.

Findings

01

Achieves a state-of-the-art Inception Score of 8.72.

02

Outperforms AudioGen, AudioLDM, and Tango in experiments.

03

Enhances robustness and accuracy of TTA models.

Abstract

Text-to-Audio (TTA) aims to generate audio that corresponds to the given text description, playing a crucial role in media production. The text descriptions in TTA datasets lack rich variations and diversity, resulting in a drop in TTA model performance when faced with complex text. To address this issue, we propose a method called Portable Plug-in Prompt Refiner, which utilizes rich knowledge about textual descriptions inherent in large language models to effectively enhance the robustness of TTA acoustic models without altering the acoustic training set. Furthermore, a Chain-of-Thought that mimics human verification is introduced to enhance the accuracy of audio descriptions, thereby improving the accuracy of generated content in practical applications. The experiments show that our method achieves a state-of-the-art Inception Score (IS) of 8.72, surpassing AudioGen, AudioLDM and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing