Vision Harnessing Agent for Open Ad-hoc Segmentation

Zilin Wang; Stella X. Yu

arXiv:2605.19410·cs.CV·May 20, 2026

Vision Harnessing Agent for Open Ad-hoc Segmentation

Zilin Wang, Stella X. Yu

PDF

TL;DR

VASA is a training-free vision-guided agent that performs open ad-hoc segmentation by reasoning, constructing, and validating visual solutions using a persistent mask and multiple models.

Contribution

It introduces VASA, the first vision harnessing agent for open ad-hoc segmentation, combining models and workflows for flexible, reasoning-based visual construction.

Findings

01

VASA outperforms existing baselines on PARS and RefCOCOm benchmarks.

02

VASA surpasses SAM3 Agent by 14-25% on PARS.

03

VASA improves over other agentic baselines by up to 20% on RefCOCOm.

Abstract

Segmentation has become easy when the concept is known, requiring retrieval of a learned visual grounding from text. It remains hard for open ad-hoc concepts, where the grounding may not exist as one learned mask and must often be constructed from image evidence through parts, relations, exclusions, and collections. We propose a Vision-guided Ad-hoc Segmentation Agent (VASA), the first vision harnessing agent for open ad-hoc segmentation. VASA is training-free and couples a VLM agent, a segmentation foundation model, and a visually grounded workflow. Rather than revising text prompts alone, VASA uses a persistent working mask to reason, construct, and validate a solution. It plans visual operations, invokes segmentation tools, inspects results, edits the mask, and recovers from errors. We construct PARS, a new benchmark that turns part-level labels in PartImageNet into open ad-hoc…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.