Text4Seg: Reimagining Image Segmentation as Text Generation

Mengcheng Lan; Chaofeng Chen; Yue Zhou; Jiaxing Xu; Yiping Ke,; Xinjiang Wang; Litong Feng; Wayne Zhang

arXiv:2410.09855·cs.CV·February 18, 2025

Text4Seg: Reimagining Image Segmentation as Text Generation

Mengcheng Lan, Chaofeng Chen, Yue Zhou, Jiaxing Xu, Yiping Ke,, Xinjiang Wang, Litong Feng, Wayne Zhang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

Text4Seg introduces a text-based image segmentation method that simplifies integration with multimodal large language models, achieving competitive performance and efficiency improvements through semantic descriptors and encoding techniques.

Contribution

It presents a novel text-as-mask paradigm with semantic descriptors and R-RLE encoding, enabling effective and efficient image segmentation within MLLMs without extra decoders.

Findings

01

Achieves state-of-the-art results on multiple datasets.

02

Reduces semantic descriptor length by 74%.

03

Speeds up inference by 3 times.

Abstract

Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks; however, effectively integrating image segmentation into these models remains a significant challenge. In this paper, we introduce Text4Seg, a novel text-as-mask paradigm that casts image segmentation as a text generation problem, eliminating the need for additional decoders and significantly simplifying the segmentation process. Our key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label. This unified representation allows seamless integration into the auto-regressive training pipeline of MLLMs for easier optimization. We demonstrate that representing an image with $16 \times 16$ semantic descriptors yields competitive segmentation performance. To enhance efficiency, we introduce the…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

The authors have conducted extensive experiments and ablation studies to demonstrate the effectiveness of their proposed method.

Weaknesses

The paper presents a method that integrates image segmentation into MLLMs by introducing semantic descriptors and utilizing a SAM mask refiner. While the approach simplifies the segmentation process by treating it as a text generation task, the technical contributions appear to be more incremental and engineering-oriented. The method essentially adapts existing MLLMs with semantic descriptors to perform segmentation tasks, serving as a baseline framework that can be applied to different MLLM mod

Reviewer 02Rating 5Confidence 2

Strengths

- model pixel label as semantic text token by generalized VLLM model - compress token length with row-wise run-length encoding - comprehensive results on referring express segmentation and comprehension

Weaknesses

- idea is simple and straightforward - relies on SAM to get the final pixel level prediction from patch level semantic text token prediction - should compare with other VLLM approach also with SAM refinement

Reviewer 03Rating 8Confidence 4

Strengths

Originality - Authors propose a new formulation of mask predictions by MLLMs. Different from previous work that represent masks as a special <SEG> token or coordinates, this work formulates the mask as a sequence of text labels (called *semantic descriptors*) of 'others' and the queried target label. Such formulation allows the MLLM to be trained for segmentation with the LLM's original autoregressive training objective, allowing easier optimization and the maintanence of the architecture. Alth

Weaknesses

- Although authors stress as a main advantage of their mask formulation as not requiring a segmentation decoder throughout their paper, TextSeg uses SAM to acquire performance comparable to other generalist segmentation models. More precisely, TextSeg *does* require a segmentation decoder but *does not* require finetuning it. Thus, instead of saying that TextSeg is a decoder-free framework, authors should explain that it is a decoder-training-free framework. - Since the authors say that SAM is

Code & Models

Repositories

mc-lan/text4seg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Mathematics, Computing, and Information Processing