MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks

Sanjoy Chowdhury; Mohamed Elmoghany; Yohan Abeysinghe; Junjie Fei; Sayan Nag; Salman Khan; Mohamed Elhoseiny; Dinesh Manocha

arXiv:2506.07016·cs.CV·June 17, 2025

MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks

Sanjoy Chowdhury, Mohamed Elmoghany, Yohan Abeysinghe, Junjie Fei, Sayan Nag, Salman Khan, Mohamed Elhoseiny, Dinesh Manocha

PDF

Open Access 1 Video

TL;DR

This paper introduces AVHaystacks, a new large-scale benchmark for audio-visual reasoning across multiple videos, and proposes MAGNET, a multi-agent framework that significantly improves multi-video retrieval and temporal grounding in complex scenarios.

Contribution

The paper presents AVHaystacks, a comprehensive benchmark for multi-video reasoning, and introduces MAGNET, a novel multi-agent framework that enhances audio-visual understanding in large-scale retrieval tasks.

Findings

01

MAGNET achieves up to 89% improvement in BLEU@4 scores.

02

MAGNET outperforms baseline methods on AVHaystacks benchmark.

03

New metrics STEM and MTGS enable better evaluation of multi-video reasoning.

Abstract

Large multimodal models (LMMs) have shown remarkable progress in audio-visual understanding, yet they struggle with real-world scenarios that require complex reasoning across extensive video collections. Existing benchmarks for video question answering remain limited in scope, typically involving one clip per query, which falls short of representing the challenges of large-scale, audio-visual retrieval and reasoning encountered in practical applications. To bridge this gap, we introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer. To this end, we present AVHaystacks, an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task. Additionally, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Video Analysis and Summarization