Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder

Jingchao Wang; Zhijian Wu; Dingjiang Huang; Yefeng Zheng; Hong Wang

arXiv:2508.04107·cs.CV·August 20, 2025

Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder

Jingchao Wang, Zhijian Wu, Dingjiang Huang, Yefeng Zheng, Hong Wang

PDF

5 Models

TL;DR

This paper introduces MLLMSeg, a lightweight framework that enhances referring expression segmentation by fully utilizing visual features from MLLMs and a compact mask decoder, achieving high accuracy with low computational cost.

Contribution

The paper proposes MLLMSeg, a novel, cost-effective framework that exploits visual detail features in MLLMs and introduces a lightweight mask decoder for improved segmentation performance.

Findings

01

Outperforms SAM-based and SAM-free methods in accuracy.

02

Uses only 34M parameters in the mask decoder.

03

Achieves a better performance-cost trade-off.

Abstract

Reference Expression Segmentation (RES) aims to segment image regions specified by referring expressions and has become popular with the rise of multimodal large models (MLLMs). While MLLMs excel in semantic understanding, their token-generation paradigm struggles with pixel-level dense prediction. Existing RES methods either couple MLLMs with the parameter-heavy Segment Anything Model (SAM) with 632M network parameters or adopt SAM-free lightweight pipelines that sacrifice accuracy. To address the trade-off between performance and cost, we specifically propose MLLMSeg, a novel framework that fully exploits the inherent visual detail features encoded in the MLLM vision encoder without introducing an extra visual encoder. Besides, we propose a detail-enhanced and semantic-consistent feature fusion module (DSFF) that fully integrates the detail-related visual feature with the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.