ResAgent: Entropy-based Prior Point Discovery and Visual Reasoning for Referring Expression Segmentation
Yihao Wang, Jusheng Zhang, Ziyi Tang, Keze Wang, Meng Yang

TL;DR
ResAgent introduces an entropy-based point discovery and vision-based reasoning framework that significantly improves referring expression segmentation accuracy by effectively identifying informative points and validating them visually.
Contribution
The paper proposes ResAgent, a novel framework combining entropy-based point discovery and visual reasoning to enhance segmentation precision over existing methods.
Findings
Achieves state-of-the-art results on four benchmark datasets.
Effectively identifies high-information points within coarse bounding boxes.
Demonstrates robustness by reducing reliance on textual coordinate reasoning.
Abstract
Referring Expression Segmentation (RES) is a core vision-language segmentation task that enables pixel-level understanding of targets via free-form linguistic expressions, supporting critical applications such as human-robot interaction and augmented reality. Despite the progress of Multimodal Large Language Model (MLLM)-based approaches, existing RES methods still suffer from two key limitations: first, the coarse bounding boxes from MLLMs lead to redundant or non-discriminative point prompts; second, the prevalent reliance on textual coordinate reasoning is unreliable, as it fails to distinguish targets from visually similar distractors. To address these issues, we propose \textbf{\model}, a novel RES framework integrating \textbf{E}ntropy-\textbf{B}ased Point \textbf{D}iscovery (\textbf{EBD}) and \textbf{V}ision-\textbf{B}ased \textbf{R}easoning (\textbf{VBR}). Specifically, EBD…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques
