Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation

Jinxing Zhou; Yanghao Zhou; Mingfei Han; Tong Wang; Xiaojun Chang; Hisham Cholakkal; Rao Muhammad Anwer

arXiv:2508.04418·cs.MM·August 7, 2025

Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation

Jinxing Zhou, Yanghao Zhou, Mingfei Han, Tong Wang, Xiaojun Chang, Hisham Cholakkal, Rao Muhammad Anwer

PDF

1 Video

TL;DR

This paper introduces TGS-Agent, a reasoning-based approach for referring audio-visual segmentation that mimics human reasoning, leveraging explicit object understanding and multimodal analysis to improve segmentation accuracy without pixel-level supervision.

Contribution

The paper proposes a novel explicit reasoning framework with Ref-Thinker and a new benchmark R2-AVSBench for better evaluation of reasoning-intensive referring AVS tasks.

Findings

01

Achieves state-of-the-art results on Ref-AVSBench

02

Introduces a new benchmark R2-AVSBench with diverse references

03

Demonstrates effectiveness of explicit reasoning over latent embedding methods

Abstract

Referring Audio-Visual Segmentation (Ref-AVS) aims to segment target objects in audible videos based on given reference expressions. Prior works typically rely on learning latent embeddings via multimodal fusion to prompt a tunable SAM/SAM2 decoder for segmentation, which requires strong pixel-level supervision and lacks interpretability. From a novel perspective of explicit reference understanding, we propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process, mimicking the human reasoning procedure by first identifying the referred object through multimodal analysis, followed by coarse-grained grounding and precise segmentation. To this end, we first propose Ref-Thinker, a multimodal language model capable of reasoning over textual, visual, and auditory cues. We construct an instruction-tuning dataset with explicit object-aware think-answer chains for Ref-Thinker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation· underline