LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation
Huadong Tang, Youpeng Zhao, Yan Huang, Min Xu, Jun Wang, Qiang Wu

TL;DR
LMSeg enhances open-vocabulary semantic segmentation by using large language models to generate detailed prompts and combining multiple vision models for improved pixel-level feature alignment, achieving state-of-the-art results.
Contribution
It introduces a novel approach that leverages large language models and multiple vision models to improve attribute-rich prompts and pixel-level features for open-vocabulary segmentation.
Findings
Achieves state-of-the-art performance on major benchmarks.
Effectively incorporates detailed object attributes into prompts.
Enhances visual feature extraction with a learnable fusion strategy.
Abstract
It is widely agreed that open-vocabulary-based approaches outperform classical closed-set training solutions for recognizing unseen objects in images for semantic segmentation. Existing open-vocabulary approaches leverage vision-language models, such as CLIP, to align visual features with rich semantic features acquired through pre-training on large-scale vision-language datasets. However, the text prompts employed in these methods are short phrases based on fixed templates, failing to capture comprehensive object attributes. Moreover, while the CLIP model excels at exploiting image-level features, it is less effective at pixel-level representation, which is crucial for semantic segmentation tasks. In this work, we propose to alleviate the above-mentioned issues by leveraging multiple large-scale models to enhance the alignment between fine-grained visual features and enriched…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsSegment Anything Model · ALIGN · Contrastive Language-Image Pre-training
