$S^3$: Synonymous Semantic Space for Improving Zero-Shot Generalization   of Vision-Language Models

Xiaojie Yin; Qilong Wang; Bing Cao; Qinghua Hu

arXiv:2412.04925·cs.CV·December 9, 2024

$S^3$: Synonymous Semantic Space for Improving Zero-Shot Generalization of Vision-Language Models

Xiaojie Yin, Qilong Wang, Bing Cao, Qinghua Hu

PDF

Open Access

TL;DR

This paper introduces $S^3$, a synonymous semantic space for each image class that leverages multiple textual concepts generated by large language models to improve zero-shot generalization of vision-language models like CLIP.

Contribution

The paper proposes a novel semantic space construction using synonymous concepts and Vietoris-Rips complex, enhancing zero-shot performance by addressing lexical variation.

Findings

01

$S^3$ outperforms state-of-the-art methods on 17 benchmarks.

02

Using multiple synonyms stabilizes semantic alignment.

03

Point-to-local-center metric improves zero-shot predictions.

Abstract

Recently, many studies have been conducted to enhance the zero-shot generalization ability of vision-language models (e.g., CLIP) by addressing the semantic misalignment between image and text embeddings in downstream tasks. Although many efforts have been made, existing methods barely consider the fact that a class of images can be described by notably different textual concepts due to well-known lexical variation in natural language processing, which heavily affects the zero-shot generalization of CLIP. Therefore, this paper proposes a \textbf{S}ynonymous \textbf{S}emantic \textbf{S}pace ( $S^{3}$ ) for each image class, rather than relying on a single textual concept, achieving more stable semantic alignment and improving the zero-shot generalization of CLIP. Specifically, our $S^{3}$ method first generates several synonymous concepts based on the label of each class by using large…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training