Visual Word Sense Disambiguation with CLIP through Dual-Channel Text Prompting and Image Augmentations
Shamik Bhattacharya, Daniel Perkins, Yaren Dogan, Vineeth Konjeti, Sudarshan Srinivasan, and Edmon Begoli

TL;DR
This paper introduces a visual word sense disambiguation framework using CLIP, dual-channel prompts, and image augmentations to improve lexical ambiguity resolution in multimodal contexts.
Contribution
It proposes an interpretable VWSD method leveraging CLIP with dual-channel prompts and robust augmentations, demonstrating improved performance on the SemEval-2023 dataset.
Findings
Enriching embeddings increases MRR from 0.7227 to 0.7590.
Dual-channel prompting offers strong, low-latency performance.
Aggressive image augmentation yields marginal gains.
Abstract
Ambiguity poses persistent challenges in natural language understanding for large language models (LLMs). To better understand how lexical ambiguity can be resolved through the visual domain, we develop an interpretable Visual Word Sense Disambiguation (VWSD) framework. The model leverages CLIP to project ambiguous language and candidate images into a shared multimodal space. We enrich textual embeddings using a dual-channel ensemble of semantic and photo-based prompts with WordNet synonyms, while image embeddings are refined through robust test-time augmentations. We then use cosine similarity to determine the image that best aligns with the ambiguous text. When evaluated on the SemEval-2023 VWSD dataset, enriching the embeddings raises the MRR from 0.7227 to 0.7590 and the Hit Rate from 0.5810 to 0.6220. Ablation studies reveal that dual-channel prompting provides strong, low-latency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
