$S^3$: Synonymous Semantic Space for Improving Zero-Shot Generalization of Vision-Language Models
Xiaojie Yin, Qilong Wang, Bing Cao, Qinghua Hu

TL;DR
This paper introduces $S^3$, a synonymous semantic space for each image class that leverages multiple textual concepts generated by large language models to improve zero-shot generalization of vision-language models like CLIP.
Contribution
The paper proposes a novel semantic space construction using synonymous concepts and Vietoris-Rips complex, enhancing zero-shot performance by addressing lexical variation.
Findings
$S^3$ outperforms state-of-the-art methods on 17 benchmarks.
Using multiple synonyms stabilizes semantic alignment.
Point-to-local-center metric improves zero-shot predictions.
Abstract
Recently, many studies have been conducted to enhance the zero-shot generalization ability of vision-language models (e.g., CLIP) by addressing the semantic misalignment between image and text embeddings in downstream tasks. Although many efforts have been made, existing methods barely consider the fact that a class of images can be described by notably different textual concepts due to well-known lexical variation in natural language processing, which heavily affects the zero-shot generalization of CLIP. Therefore, this paper proposes a \textbf{S}ynonymous \textbf{S}emantic \textbf{S}pace () for each image class, rather than relying on a single textual concept, achieving more stable semantic alignment and improving the zero-shot generalization of CLIP. Specifically, our method first generates several synonymous concepts based on the label of each class by using large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
