Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for   Target Style Audio Generation

Chenxu Xiong; Ruibo Fu; Shuchen Shi; Zhengqi Wen; Jianhua Tao; Tao; Wang; Chenxing Li; Chunyu Qiang; Yuankun Xie; Xin Qi; Guanjun Li; Zizheng; Yang

arXiv:2409.09381·eess.AS·September 17, 2024

Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation

Chenxu Xiong, Ruibo Fu, Shuchen Shi, Zhengqi Wen, Jianhua Tao, Tao, Wang, Chenxing Li, Chunyu Qiang, Yuankun Xie, Xin Qi, Guanjun Li, Zizheng, Yang

PDF

Open Access

TL;DR

This paper introduces a novel sound event enhanced prompt adapter for multi-style audio generation that combines text and audio references, achieving state-of-the-art results and better style control.

Contribution

It proposes a new adaptive style transfer method using cross-attention and layer normalization, along with a new dataset for dual-prompt audio generation.

Findings

01

Achieved state-of-the-art Fréchet Distance of 26.94

02

Attained KL Divergence of 1.82

03

Generated audio closely matches reference styles

Abstract

Current mainstream audio generation methods primarily rely on simple text prompts, often failing to capture the nuanced details necessary for multi-style audio generation. To address this limitation, the Sound Event Enhanced Prompt Adapter is proposed. Unlike traditional static global style transfer, this method extracts style embedding through cross-attention between text and reference audio for adaptive style control. Adaptive layer normalization is then utilized to enhance the model's capacity to express multiple styles. Additionally, the Sound Event Reference Style Transfer Dataset (SERST) is introduced for the proposed target style audio generation task, enabling dual-prompt audio generation using both text and audio references. Experimental results demonstrate the robustness of the model, achieving state-of-the-art Fr\'echet Distance of 26.94 and KL Divergence of 1.82, surpassing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing

MethodsLayer Normalization · Adapter