Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition
Zixuan Wang, Chi-Keung Tang, Yu-Wing Tai

TL;DR
Audio-Agent combines large language models and diffusion networks to enable high-quality, flexible audio generation and editing from text or video inputs, overcoming limitations of single-pass inference methods.
Contribution
The paper introduces a multimodal framework that integrates GPT-4 with a diffusion-based TTA network and fine-tuned LLMs for efficient, high-quality audio generation and video-to-audio tasks.
Findings
High-quality audio generated from complex text conditions
Effective video-to-audio synchronization without extensive training
Supports variable-length and volume audio outputs
Abstract
We introduce Audio-Agent, a multimodal framework for audio generation, editing and composition based on text or video inputs. Conventional approaches for text-to-audio (TTA) tasks often make single-pass inferences from text descriptions. While straightforward, this design struggles to produce high-quality audio when given complex text conditions. In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4, which decomposes the text condition into atomic, specific instructions and calls the agent for audio generation. In doing so, Audio-Agent can generate high-quality audio that is closely aligned with the provided text or video exhibiting complex and multiple events, while supporting variable-length and variable-volume generation. For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security
MethodsAttention Is All You Need · Dense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings
