Vision Harnessing Agent for Open Ad-hoc Segmentation
Zilin Wang, Stella X. Yu

TL;DR
VASA is a training-free vision-guided agent that performs open ad-hoc segmentation by reasoning, constructing, and validating visual solutions using a persistent mask and multiple models.
Contribution
It introduces VASA, the first vision harnessing agent for open ad-hoc segmentation, combining models and workflows for flexible, reasoning-based visual construction.
Findings
VASA outperforms existing baselines on PARS and RefCOCOm benchmarks.
VASA surpasses SAM3 Agent by 14-25% on PARS.
VASA improves over other agentic baselines by up to 20% on RefCOCOm.
Abstract
Segmentation has become easy when the concept is known, requiring retrieval of a learned visual grounding from text. It remains hard for open ad-hoc concepts, where the grounding may not exist as one learned mask and must often be constructed from image evidence through parts, relations, exclusions, and collections. We propose a Vision-guided Ad-hoc Segmentation Agent (VASA), the first vision harnessing agent for open ad-hoc segmentation. VASA is training-free and couples a VLM agent, a segmentation foundation model, and a visually grounded workflow. Rather than revising text prompts alone, VASA uses a persistent working mask to reason, construct, and validate a solution. It plans visual operations, invokes segmentation tools, inspects results, edits the mask, and recovers from errors. We construct PARS, a new benchmark that turns part-level labels in PartImageNet into open ad-hoc…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
