X-SAM: From Segment Anything to Any Segmentation

Hao Wang; Limeng Qiao; Zequn Jie; Zhijian Huang; Chengjian Feng; Qingfang Zheng; Lin Ma; Xiangyuan Lan; Xiaodan Liang

arXiv:2508.04655·cs.CV·January 29, 2026

X-SAM: From Segment Anything to Any Segmentation

Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, Xiaodan Liang

PDF

1 Models

TL;DR

X-SAM extends the Segment Anything Model to a unified framework capable of any segmentation task, introducing the VGD segmentation task and a co-training strategy, achieving state-of-the-art results in pixel-level visual understanding.

Contribution

The paper proposes X-SAM, a novel multimodal large language model framework that unifies various segmentation tasks and introduces VGD segmentation for enhanced pixel-level comprehension.

Findings

01

Achieves state-of-the-art performance on multiple segmentation benchmarks.

02

Effectively co-trains across diverse datasets for improved generalization.

03

Demonstrates efficient multimodal, pixel-level visual understanding.

Abstract

Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from \textit{segment anything} to \textit{any segmentation}. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
hao9610/X-SAM
model· ♡ 6
♡ 6

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.