ARGenSeg: Image Segmentation with Autoregressive Image Generation Model

Xiaolong Wang; Lixiang Ru; Ziyuan Huang; Kaixiang Ji; Dandan Zheng; Jingdong Chen; Jun Zhou

arXiv:2510.20803·cs.CV·October 24, 2025

ARGenSeg: Image Segmentation with Autoregressive Image Generation Model

Xiaolong Wang, Lixiang Ru, Ziyuan Huang, Kaixiang Ji, Dandan Zheng, Jingdong Chen, Jun Zhou

PDF

Open Access

TL;DR

ARGenSeg introduces a novel image segmentation approach using autoregressive image generation within large language models, enabling dense pixel-level understanding and faster inference compared to prior methods.

Contribution

It presents a unified segmentation framework based on image generation that leverages MLLMs and VQ-VAE for dense masks, improving speed and accuracy over existing techniques.

Findings

01

Outperforms prior state-of-the-art on multiple datasets

02

Achieves faster inference with parallel visual token generation

03

Maintains strong visual understanding capabilities

Abstract

We propose a novel AutoRegressive Generation-based paradigm for image Segmentation (ARGenSeg), achieving multimodal understanding and pixel-level perception within a unified framework. Prior works integrating image segmentation into multimodal large language models (MLLMs) typically employ either boundary points representation or dedicated segmentation heads. These methods rely on discrete representations or semantic prompts fed into task-specific decoders, which limits the ability of the MLLM to capture fine-grained visual details. To address these challenges, we introduce a segmentation framework for MLLM based on image generation, which naturally produces dense masks for target objects. We leverage MLLM to output visual tokens and detokenize them into images using an universal VQ-VAE, making the segmentation fully dependent on the pixel-level understanding of the MLLM. To reduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning