Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP
Eunji Kim, Kyuhong Shim, Simyung Chang, Sungroh Yoon

TL;DR
This paper introduces SToRI, a framework that reweights semantic tokens in CLIP's text embeddings to improve interpretability and controllability, demonstrated through experiments on image classification and retrieval.
Contribution
The paper presents SToRI, a novel method for differential semantic token weighting in CLIP, enhancing interpretability and user-controlled emphasis in text embeddings.
Findings
Improved interpretability of CLIP text embeddings.
Enhanced controllability over semantic emphasis.
Better performance in few-shot image classification and retrieval.
Abstract
A text encoder within Vision-Language Models (VLMs) like CLIP plays a crucial role in translating textual input into an embedding space shared with images, thereby facilitating the interpretative analysis of vision tasks through natural language. Despite the varying significance of different textual elements within a sentence depending on the context, efforts to account for variation of importance in constructing text embeddings have been lacking. We propose a framework of Semantic Token Reweighting to build Interpretable text embeddings (SToRI), which incorporates controllability as well. SToRI refines the text encoding process in CLIP by differentially weighting semantic elements based on contextual importance, enabling finer control over emphasis responsive to data-driven insights and user preferences. The efficacy of SToRI is demonstrated through comprehensive experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsContrastive Language-Image Pre-training
