Text4Seg: Reimagining Image Segmentation as Text Generation
Mengcheng Lan, Chaofeng Chen, Yue Zhou, Jiaxing Xu, Yiping Ke,, Xinjiang Wang, Litong Feng, Wayne Zhang

TL;DR
Text4Seg introduces a text-based image segmentation method that simplifies integration with multimodal large language models, achieving competitive performance and efficiency improvements through semantic descriptors and encoding techniques.
Contribution
It presents a novel text-as-mask paradigm with semantic descriptors and R-RLE encoding, enabling effective and efficient image segmentation within MLLMs without extra decoders.
Findings
Achieves state-of-the-art results on multiple datasets.
Reduces semantic descriptor length by 74%.
Speeds up inference by 3 times.
Abstract
Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks; however, effectively integrating image segmentation into these models remains a significant challenge. In this paper, we introduce Text4Seg, a novel text-as-mask paradigm that casts image segmentation as a text generation problem, eliminating the need for additional decoders and significantly simplifying the segmentation process. Our key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label. This unified representation allows seamless integration into the auto-regressive training pipeline of MLLMs for easier optimization. We demonstrate that representing an image with semantic descriptors yields competitive segmentation performance. To enhance efficiency, we introduce the…
Peer Reviews
Decision·ICLR 2025 Poster
The authors have conducted extensive experiments and ablation studies to demonstrate the effectiveness of their proposed method.
The paper presents a method that integrates image segmentation into MLLMs by introducing semantic descriptors and utilizing a SAM mask refiner. While the approach simplifies the segmentation process by treating it as a text generation task, the technical contributions appear to be more incremental and engineering-oriented. The method essentially adapts existing MLLMs with semantic descriptors to perform segmentation tasks, serving as a baseline framework that can be applied to different MLLM mod
- model pixel label as semantic text token by generalized VLLM model - compress token length with row-wise run-length encoding - comprehensive results on referring express segmentation and comprehension
- idea is simple and straightforward - relies on SAM to get the final pixel level prediction from patch level semantic text token prediction - should compare with other VLLM approach also with SAM refinement
Originality - Authors propose a new formulation of mask predictions by MLLMs. Different from previous work that represent masks as a special <SEG> token or coordinates, this work formulates the mask as a sequence of text labels (called *semantic descriptors*) of 'others' and the queried target label. Such formulation allows the MLLM to be trained for segmentation with the LLM's original autoregressive training objective, allowing easier optimization and the maintanence of the architecture. Alth
- Although authors stress as a main advantage of their mask formulation as not requiring a segmentation decoder throughout their paper, TextSeg uses SAM to acquire performance comparable to other generalist segmentation models. More precisely, TextSeg *does* require a segmentation decoder but *does not* require finetuning it. Thus, instead of saying that TextSeg is a decoder-free framework, authors should explain that it is a decoder-training-free framework. - Since the authors say that SAM is
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Mathematics, Computing, and Information Processing
