LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Huadong Tang; Youpeng Zhao; Yan Huang; Min Xu; Jun Wang; Qiang Wu

arXiv:2412.00364·cs.CV·February 19, 2026

LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Huadong Tang, Youpeng Zhao, Yan Huang, Min Xu, Jun Wang, Qiang Wu

PDF

Open Access

TL;DR

LMSeg enhances open-vocabulary semantic segmentation by using large language models to generate detailed prompts and combining multiple vision models for improved pixel-level feature alignment, achieving state-of-the-art results.

Contribution

It introduces a novel approach that leverages large language models and multiple vision models to improve attribute-rich prompts and pixel-level features for open-vocabulary segmentation.

Findings

01

Achieves state-of-the-art performance on major benchmarks.

02

Effectively incorporates detailed object attributes into prompts.

03

Enhances visual feature extraction with a learnable fusion strategy.

Abstract

It is widely agreed that open-vocabulary-based approaches outperform classical closed-set training solutions for recognizing unseen objects in images for semantic segmentation. Existing open-vocabulary approaches leverage vision-language models, such as CLIP, to align visual features with rich semantic features acquired through pre-training on large-scale vision-language datasets. However, the text prompts employed in these methods are short phrases based on fixed templates, failing to capture comprehensive object attributes. Moreover, while the CLIP model excels at exploiting image-level features, it is less effective at pixel-level representation, which is crucial for semantic segmentation tasks. In this work, we propose to alleviate the above-mentioned issues by leveraging multiple large-scale models to enhance the alignment between fine-grained visual features and enriched…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsSegment Anything Model · ALIGN · Contrastive Language-Image Pre-training