TL;DR
LlamaSeg introduces a novel autoregressive transformer-based framework for image segmentation that uses natural language instructions, enabling flexible, open-vocabulary, and fine-grained mask generation with improved evaluation metrics.
Contribution
It reformulates image segmentation as a visual generation task using a LLaMA-style Transformer, introduces a large-scale dataset with diverse annotations, and proposes a new composite metric for mask quality assessment.
Findings
Outperforms existing generative models on multiple datasets.
Enables object localization based on text prompts.
Produces more detailed and accurate segmentation masks.
Abstract
We present LlamaSeg, a visual autoregressive framework that unifies multiple image segmentation tasks via natural language instructions. We reformulate image segmentation as a visual generation problem, representing masks as "visual" tokens and employing a LLaMA-style Transformer to predict them directly from image inputs. By adhering to the next-token prediction paradigm, our approach naturally integrates segmentation tasks into autoregressive architectures. To support large-scale training, we introduce a data annotation pipeline and construct the SA-OVRS dataset, which contains 2M segmentation masks annotated with over 5,800 open-vocabulary labels or diverse textual descriptions, covering a wide spectrum of real-world scenarios. This enables our model to localize objects in images based on text prompts and to generate fine-grained masks. To more accurately evaluate the quality of…
Peer Reviews
Decision·Submitted to ICLR 2026
### 1. Conceptual novelty Reformulating image segmentation as an autoregressive mask generation problem is a creative and elegant extension of large language model principles to pixel-level prediction. This perspective bridges the gap between generative modeling and structured visual understanding. ### 2. Unified framework The proposed approach enables seamless integration of segmentation tasks into LLM-based architectures through a consistent tokenization and generation pipeline. It
### 1. Limited scope and contribution The contribution feels more foundational within a narrow scope rather than broadly transformative. The method primarily focuses on segmentation and language alignment, without clear extensions to other modalities or tasks such as vision-language reasoning, instruction following, or general multimodal generation. Compared with highly integrative multimodal frameworks like Unified-IO, 4M-21 (Bachmann, Roman, et al. "4m-21: An any-to-any vision model for ten
1. The proposed method have unified formulation for multiple segmentation tasks such as - semantic, referring, open-vocabulary in one autoregressive model. 2. The proposed method has strong boundary fidelity which cause due to the use of mask-tokenizer and autoregressive decoding. 3. The new dataset SA-OVRS is a large one, with open-vocabulary supervision which improve the performance in multiple tasks.
1. The proposed method has lower performance on some tasks when comparing to discriminative models 2. The tokens that used has fixed downsample of ×16, which can miss fine details 3. The usage of autoregressive model has some latency issues which is much slower than discriminative models
1. LlamaSeg introduces the idea of using an image tokenizer to encode segmentation masks, effectively unifying various segmentation tasks within a discrete autoregressive framework. 2. The paper is clearly written and easy to follow.
1. The comparison baselines are outdated, and LlamaSeg’s segmentation performance is not competitive (e.g., around 56 on RefCOCO), which is significantly lower than recent methods such as Ferret-v2 [1] (≈90). 2. The relatively poor performance raises doubts about whether encoding masks using image tokenizer truly offers advantages over encoding them as discrete position tokens or point sequences. A more detailed comparison and ablation studies (including performance and efficiency) across diffe
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing
