Towards Pixel-Level VLM Perception via Simple Points Prediction
Tianhui Song, Haoyu Lu, Hao Yang, Lin Sui, Haoning Wu, Zaida Zhou, Zhiqi Huang, Yiping Bao, Y.Charles, Xinyu Zhou, Limin Wang

TL;DR
SimpleSeg demonstrates that large language models can achieve pixel-level perception by predicting object boundary points directly in language space, using a simple two-stage training process, without specialized architectures.
Contribution
The paper introduces a novel point prediction approach for segmentation within MLLMs, showing that low-level spatial understanding can emerge from simple sequence generation.
Findings
Achieves competitive segmentation performance on benchmarks.
Reveals inherent low-level perception capacity in standard MLLM architectures.
Simplifies segmentation by eliminating complex task-specific components.
Abstract
We present SimpleSeg, a strikingly simple yet highly effective approach to endow Multimodal Large Language Models (MLLMs) with native pixel-level perception. Our method reframes segmentation as a simple sequence generation problem: the model directly predicts sequences of points (textual coordinates) delineating object boundaries, entirely within its language space. To achieve high fidelity, we introduce a two-stage SFRL training pipeline, where Reinforcement Learning with an IoU-based reward refines the point sequences to accurately match ground-truth contours. We find that the standard MLLM architecture possesses a strong, inherent capacity for low-level perception that can be unlocked without any specialized architecture. On segmentation benchmarks, SimpleSeg achieves performance that is comparable to, and often surpasses, methods relying on complex, task-specific designs. This…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Reframing segmentation as point-sequence generation within language space is a clear, novel, and elegant idea that aligns with the unified philosophy of MLLMs. 2. Despite its simplicity, SimpleSeg performs comparably or better than more complex models, demonstrating that architectural minimalism can achieve strong pixel-level perception.
1. While the concept is elegant, the technical novelty may be seen as moderate; the method largely combines known ideas (polygon prediction + RL optimization). 2. There are also other decoder-free projects, such as GiT[1] and UFO[2]. It would be better to include some discussion. 3. The evaluation focuses mainly on referring expression segmentation benchmarks; more diverse real-world datasets (e.g., COCO panoptic or open-domain segmentation) would strengthen generalization claims. [1] "Git: To
1) Strong results on segmentation benchmarks. 2) Significance of task (wide range of practical applications) 3) Simplicity (easy to read and follow paper and implementation details, except points mentioned in Weaknesses (Presentation)). Although I believe the central claim of the paper is not sufficiently supported by the evidence, given that the results can be applied to various practical applications (such as controllable image editing, vision-based tool use, and GUI-grounded agents), I consi
1. The central claim (L92-93) "we demonstrate that standard MLLM architectures possess a strong, inherent capacity for fine-grained perception" is not sufficiently supported by evidence, as the approach is validated only on one VLM architecture -- Kimi-VL. This claim might hold for other architectures, but it was not tested in this work. 2. Some components are not presented clearly or explanation is missing (Presentation): - L268-266: Not clear what "large scale web data" means in this context,
**Originality & Significance:** - The authors do a great job in outlining why the task matters, as well as how it is approached; all in all a very clear motivation of the presented research - Simple yet elegant approach that requires no architectural modification of MLLMs to solve (referring expression) segmentation and comprehension - Good results on the benchmarks, even in the context of architecturally-tailored approaches **Quality:** - The work is placed well within related efforts, an
- **Justification for use of RL is lacking** and incomplete (imo even misleading), see questions. - **Several claims are made in a very broad manner** (mainly beyond the focus of this work), and would benefit from being toned-down a bit as they feel overstated (and are not necessarily being substantiated in the paper): - E.g. l. 170 *'…can be seamlessly […] integrated as a new, core pre-training task for foundation models'* -> Note that this is quite a big claim, as the authors train/fine-tu
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis
