Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction

Jiazhen Liu; Mingkuan Feng; Long Chen

arXiv:2512.00395·cs.CV·December 2, 2025

Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction

Jiazhen Liu, Mingkuan Feng, Long Chen

PDF

Open Access 1 Models

TL;DR

This paper introduces STAMP, a novel all-mask prediction paradigm for MLLMs that simultaneously enhances segmentation performance, preserves dialogue abilities, and ensures fast inference by decoupling autoregressive and non-autoregressive tasks.

Contribution

The paper proposes a new all-mask prediction paradigm and implements it in STAMP, effectively resolving the segmentation, dialogue, and speed trilemma in MLLMs.

Findings

01

STAMP outperforms state-of-the-art methods on multiple benchmarks.

02

It maintains dialogue ability while achieving high segmentation accuracy.

03

It enables rapid inference through parallel mask prediction.

Abstract

Integrating segmentation into Multimodal Large Language Models (MLLMs) presents a core trilemma: simultaneously preserving dialogue ability, achieving high segmentation performance, and ensuring fast inference. Prevailing paradigms are forced into a compromise. Embedding prediction methods introduce a conflicting pixel-level objective that degrades the MLLM's general dialogue abilities. The alternative, next-token prediction, reframes segmentation as an autoregressive task, which preserves dialogue but forces a trade-off between poor segmentation performance with sparse outputs or prohibitive inference speeds with rich ones. We resolve this trilemma with all-mask prediction, a novel paradigm that decouples autoregressive dialogue generation from non-autoregressive mask prediction. We present STAMP: Simultaneous Textual All-Mask Prediction, an MLLM that embodies this paradigm. After…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
JiaZL/STAMP-2B-uni
model· 13 dl· ♡ 1
13 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications