TL;DR
X-SAM extends the Segment Anything Model to a unified framework capable of any segmentation task, introducing the VGD segmentation task and a co-training strategy, achieving state-of-the-art results in pixel-level visual understanding.
Contribution
The paper proposes X-SAM, a novel multimodal large language model framework that unifies various segmentation tasks and introduces VGD segmentation for enhanced pixel-level comprehension.
Findings
Achieves state-of-the-art performance on multiple segmentation benchmarks.
Effectively co-trains across diverse datasets for improved generalization.
Demonstrates efficient multimodal, pixel-level visual understanding.
Abstract
Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from \textit{segment anything} to \textit{any segmentation}. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
