TL;DR
This paper introduces MLLMSeg, a lightweight framework that enhances referring expression segmentation by fully utilizing visual features from MLLMs and a compact mask decoder, achieving high accuracy with low computational cost.
Contribution
The paper proposes MLLMSeg, a novel, cost-effective framework that exploits visual detail features in MLLMs and introduces a lightweight mask decoder for improved segmentation performance.
Findings
Outperforms SAM-based and SAM-free methods in accuracy.
Uses only 34M parameters in the mask decoder.
Achieves a better performance-cost trade-off.
Abstract
Reference Expression Segmentation (RES) aims to segment image regions specified by referring expressions and has become popular with the rise of multimodal large models (MLLMs). While MLLMs excel in semantic understanding, their token-generation paradigm struggles with pixel-level dense prediction. Existing RES methods either couple MLLMs with the parameter-heavy Segment Anything Model (SAM) with 632M network parameters or adopt SAM-free lightweight pipelines that sacrifice accuracy. To address the trade-off between performance and cost, we specifically propose MLLMSeg, a novel framework that fully exploits the inherent visual detail features encoded in the MLLM vision encoder without introducing an extra visual encoder. Besides, we propose a detail-enhanced and semantic-consistent feature fusion module (DSFF) that fully integrates the detail-related visual feature with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
