X2SAM: Any Segmentation in Images and Videos

Hao Wang; Limeng Qiao; Chi Zhang; Lin Ma; Guanglu Wan; Xiangyuan Lan; Xiaodan Liang

arXiv:2605.00891·cs.CV·May 5, 2026

X2SAM: Any Segmentation in Images and Videos

Hao Wang, Limeng Qiao, Chi Zhang, Lin Ma, Guanglu Wan, Xiangyuan Lan, Xiaodan Liang

PDF

1 Repo

TL;DR

X2SAM is a unified model that extends segmentation capabilities from images to videos, supporting complex conversational instructions and visual prompts for diverse segmentation tasks.

Contribution

It introduces X2SAM, a novel multimodal large language model that unifies image and video segmentation with a joint training strategy and a new benchmark.

Findings

01

X2SAM achieves strong video segmentation performance.

02

It remains competitive on image segmentation benchmarks.

03

The model supports diverse interactive and grounded segmentation tasks.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated strong image-level visual understanding and reasoning, yet their pixel-level perception across both images and videos remains limited. Foundation segmentation models such as the SAM series produce high-quality masks, but they rely on low-level visual prompts and cannot natively interpret complex conversational instructions. Existing segmentation MLLMs narrow this gap, but are usually specialized for either images or videos and rarely support both textual and visual prompts in one interface. We introduce X2SAM, a unified segmentation MLLM that extends any-segmentation capabilities from images to videos. Given conversational instructions and visual prompts, X2SAM couples an LLM with a Mask Memory module that stores guided vision features for temporally consistent video mask generation. The same formulation supports generic,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wanghao9610/X2SAM
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.