TL;DR
SAM3-I extends the Segment Anything Model to interpret complex natural-language instructions for segmentation, integrating concept grounding and instruction reasoning within a unified framework.
Contribution
It introduces SAM3-I, a unified segmentation framework that directly interprets natural-language instructions, and HMPL-Instruct, a large-scale instruction-centric dataset for training.
Findings
SAM3-I achieves strong performance in instruction-following segmentation tasks.
The model effectively combines concept grounding with complex instruction understanding.
Code and dataset are publicly available at the provided GitHub link.
Abstract
Segment Anything Model 3 (SAM3) advances open-vocabulary segmentation through promptable concept segmentation, enabling users to segment all instances associated with a given concept using short noun-phrase (NP) prompts. While effective for concept-level grounding, real-world interactions often involve far richer natural-language instructions that combine attributes, relations, actions, states, or implicit reasoning. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and conducts iterative mask filtering, leading to coarse representations and limited instance specificity. In this work, we present SAM3-I, an instruction-following extension of the SAM family that unifies concept-level grounding and instruction-level reasoning within a single segmentation framework. Built upon SAM3, SAM3-I introduces an instruction-aware cascaded adaptation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
