Large Language Models and Multimodal Retrieval for Visual Word Sense Disambiguation
Anastasia Kritharoula, Maria Lymperaiou, Giorgos Stamou

TL;DR
This paper explores the use of multimodal transformers and Large Language Models to improve visual word sense disambiguation by combining various retrieval and reasoning approaches, leading to competitive results.
Contribution
It introduces a comprehensive approach combining multimodal retrieval, LLM-based knowledge enhancement, and learn-to-rank models for VWSD, advancing the state of the art.
Findings
Multimodal transformer methods improve retrieval accuracy.
LLMs with Chain-of-Thought enhance explainability.
Combining modules via learn-to-rank yields competitive performance.
Abstract
Visual Word Sense Disambiguation (VWSD) is a novel challenging task with the goal of retrieving an image among a set of candidates, which better represents the meaning of an ambiguous word within a given context. In this paper, we make a substantial step towards unveiling this interesting task by applying a varying set of approaches. Since VWSD is primarily a text-image retrieval task, we explore the latest transformer-based methods for multimodal retrieval. Additionally, we utilize Large Language Models (LLMs) as knowledge bases to enhance the given phrases and resolve ambiguity related to the target word. We also study VWSD as a unimodal problem by converting to text-to-text and image-to-image retrieval, as well as question-answering (QA), to fully explore the capabilities of relevant models. To tap into the implicit knowledge of LLMs, we experiment with Chain-of-Thought (CoT)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsSparse Evolutionary Training
