MAR3: Multi-Agent Recognition, Reasoning, and Reflection for Reference Audio-Visual Segmentation

Yuan Zhao; Zhenqi Jia; Yongqiang Zhang

arXiv:2603.27706·cs.MM·March 31, 2026

MAR3: Multi-Agent Recognition, Reasoning, and Reflection for Reference Audio-Visual Segmentation

Yuan Zhao, Zhenqi Jia, Yongqiang Zhang

PDF

TL;DR

MAR3 introduces a novel, training-free multi-agent framework for reference audio-visual segmentation that explicitly recognizes expression difficulty, dominant modality, and incorporates reflective validation, leading to superior performance.

Contribution

The paper presents MAR3, a new multi-agent, training-free framework that improves reference audio-visual segmentation by explicitly modeling expression difficulty, modality dominance, and using reflective validation.

Findings

01

Achieves 69.2% J&F on Ref-AVSBench, surpassing SOTA by 3.4%.

02

Introduces a Consensus Multimodal Recognition mechanism for better modality understanding.

03

Develops a Reflective Learning Segmentation mechanism for iterative correction of segmentation masks.

Abstract

Reference Audio-Visual Segmentation (Ref-AVS) aims to segment objects in audible videos based on multimodal cues in reference expressions. Previous methods overlook the explicit recognition of expression difficulty and dominant modality in multimodal cues, over-rely on the quality of the instruction-tuning dataset for object reasoning, and lack reflective validation of segmentation results, leading to erroneous mask predictions. To address these issues, in this paper, we propose a novel training-free Multi-Agent Recognition, Reasoning, and Reflection framework to achieve high-quality Reference Audio-Visual Segmentation, termed MAR3. Incorporating the sociological Delphi theory to achieve robust analysis, a Consensus Multimodal Recognition mechanism is proposed that enables LLM agents to explicitly recognize the difficulty of reference expressions and the dominant modality of multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.