Audio-Agent: Leveraging LLMs For Audio Generation, Editing and   Composition

Zixuan Wang; Chi-Keung Tang; Yu-Wing Tai

arXiv:2410.03335·cs.SD·January 15, 2025

Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition

Zixuan Wang, Chi-Keung Tang, Yu-Wing Tai

PDF

Open Access

TL;DR

Audio-Agent combines large language models and diffusion networks to enable high-quality, flexible audio generation and editing from text or video inputs, overcoming limitations of single-pass inference methods.

Contribution

The paper introduces a multimodal framework that integrates GPT-4 with a diffusion-based TTA network and fine-tuned LLMs for efficient, high-quality audio generation and video-to-audio tasks.

Findings

01

High-quality audio generated from complex text conditions

02

Effective video-to-audio synchronization without extensive training

03

Supports variable-length and volume audio outputs

Abstract

We introduce Audio-Agent, a multimodal framework for audio generation, editing and composition based on text or video inputs. Conventional approaches for text-to-audio (TTA) tasks often make single-pass inferences from text descriptions. While straightforward, this design struggles to produce high-quality audio when given complex text conditions. In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4, which decomposes the text condition into atomic, specific instructions and calls the agent for audio generation. In doing so, Audio-Agent can generate high-quality audio that is closely aligned with the provided text or video exhibiting complex and multiple events, while supporting variable-length and variable-volume generation. For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security

MethodsAttention Is All You Need · Dense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings