Aligning Audio-Visual Joint Representations with an Agentic Workflow

Shentong Mo; Yibing Song

arXiv:2410.23230·cs.CV·November 1, 2024

Aligning Audio-Visual Joint Representations with an Agentic Workflow

Shentong Mo, Yibing Song

PDF

Open Access

TL;DR

This paper introduces an agentic workflow using an LLM-based assistant to iteratively align audio signals with visual data, enhancing audio-visual joint representations for better performance in downstream tasks.

Contribution

It presents a novel agentic workflow that employs multi-modal LLMs for data alignment, incorporating reasoning, editing, and feedback to improve AV representations.

Findings

01

Achieves state-of-the-art results on multiple AV tasks.

02

Demonstrates effective noise filtering and synchronization correction.

03

Enhances downstream application performance through data alignment.

Abstract

Visual content and accompanied audio signals naturally formulate a joint representation to improve audio-visual (AV) related applications. While studies develop various AV representation learning frameworks, the importance of AV data alignment is usually undermined for achieving high-quality representation. We observe that an audio signal may contain background noise interference. Also, non-synchronization may appear between audio and video streams. These non-strict data alignment limits representation quality and downgrade application performance. In this paper, we propose to improve AV joint representations from a data-centric perspective by aligning audio signals to visual data. Our alignment is conducted in an agentic workflow controlled by an LLM-based assistant named AVAgent. For each input AV data pair, our AVAgent uses a multi-modal LLM to convert audio and visual data into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Human Motion and Animation · Music Technology and Sound Studies