Aligning Audio-Visual Joint Representations with an Agentic Workflow
Shentong Mo, Yibing Song

TL;DR
This paper introduces an agentic workflow using an LLM-based assistant to iteratively align audio signals with visual data, enhancing audio-visual joint representations for better performance in downstream tasks.
Contribution
It presents a novel agentic workflow that employs multi-modal LLMs for data alignment, incorporating reasoning, editing, and feedback to improve AV representations.
Findings
Achieves state-of-the-art results on multiple AV tasks.
Demonstrates effective noise filtering and synchronization correction.
Enhances downstream application performance through data alignment.
Abstract
Visual content and accompanied audio signals naturally formulate a joint representation to improve audio-visual (AV) related applications. While studies develop various AV representation learning frameworks, the importance of AV data alignment is usually undermined for achieving high-quality representation. We observe that an audio signal may contain background noise interference. Also, non-synchronization may appear between audio and video streams. These non-strict data alignment limits representation quality and downgrade application performance. In this paper, we propose to improve AV joint representations from a data-centric perspective by aligning audio signals to visual data. Our alignment is conducted in an agentic workflow controlled by an LLM-based assistant named AVAgent. For each input AV data pair, our AVAgent uses a multi-modal LLM to convert audio and visual data into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Human Motion and Animation · Music Technology and Sound Studies
