Bridging Lexical Ambiguity and Vision: A Mini Review on Visual Word Sense Disambiguation
Shashini Nilukshi, Deshan Sumanathilaka

TL;DR
This mini review discusses recent advances in Visual Word Sense Disambiguation, highlighting how multimodal models like CLIP and LLMs improve lexical ambiguity resolution in vision-language tasks, with notable performance gains and ongoing challenges.
Contribution
It provides a comprehensive overview of VWSD development from 2016 to 2025, emphasizing new frameworks, techniques, and future directions in multimodal disambiguation.
Findings
CLIP-based models outperform zero-shot baselines by 6-8% in MRR.
Multimodal approaches improve disambiguation accuracy.
Challenges include limited multilingual datasets and evaluation frameworks.
Abstract
This paper offers a mini review of Visual Word Sense Disambiguation (VWSD), which is a multimodal extension of traditional Word Sense Disambiguation (WSD). VWSD helps tackle lexical ambiguity in vision-language tasks. While conventional WSD depends only on text and lexical resources, VWSD uses visual cues to find the right meaning of ambiguous words with minimal text input. The review looks at developments from early multimodal fusion methods to new frameworks that use contrastive models like CLIP, diffusion-based text-to-image generation, and large language model (LLM) support. Studies from 2016 to 2025 are examined to show the growth of VWSD through feature-based, graph-based, and contrastive embedding techniques. It focuses on prompt engineering, fine-tuning, and adapting to multiple languages. Quantitative results show that CLIP-based fine-tuned models and LLM-enhanced VWSD systems…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Language, Metaphor, and Cognition
