Harnessing Vision Foundation Models for High-Performance, Training-Free   Open Vocabulary Segmentation

Yuheng Shi; Minjing Dong; Chang Xu

arXiv:2411.09219·cs.CV·November 15, 2024

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

Yuheng Shi, Minjing Dong, Chang Xu

PDF

Open Access 1 Repo

TL;DR

This paper introduces Trident, a training-free framework that combines CLIP, DINO, and SAM to improve open-vocabulary semantic segmentation by addressing resolution limitations and refining coarse outputs, achieving state-of-the-art results.

Contribution

The paper proposes a novel splice-then-segment paradigm using SAM to enhance resolution handling in open-vocabulary segmentation without additional training.

Findings

01

Significant mIoU improvement from 44.4 to 48.6 across eight benchmarks.

02

Effective integration of CLIP, DINO, and SAM for high-resolution segmentation.

03

Refinement strategy improves coarse segmentation outputs.

Abstract

While Contrastive Language-Image Pre-training (CLIP) has advanced open-vocabulary predictions, its performance on semantic segmentation remains suboptimal. This shortfall primarily stems from its spatial-invariant semantic features and constrained resolution. While previous adaptations addressed spatial invariance semantic by modifying the self-attention in CLIP's image encoder, the issue of limited resolution remains unexplored. Different from previous segment-then-splice methods that segment sub-images via a sliding window and splice the results, we introduce a splice-then-segment paradigm that incorporates Segment-Anything Model (SAM) to tackle the resolution issue since SAM excels at extracting fine-grained semantic correlations from high-resolution images. Specifically, we introduce Trident, a training-free framework that first splices features extracted by CLIP and DINO from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

YuHengsss/Trident
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Softmax · Multi-Head Attention · Layer Normalization · Residual Connection · Vision Transformer · Segment Anything Model · self-DIstillation with NO labels