HKUST at SemEval-2023 Task 1: Visual Word Sense Disambiguation with Context Augmentation and Visual Assistance
Zhuohao Yin, Xin Huang

TL;DR
This paper introduces a multi-modal retrieval framework for Visual Word Sense Disambiguation that leverages pretrained models, knowledge bases, and datasets to improve sense selection by integrating textual and visual information.
Contribution
The paper presents a novel multi-modal retrieval system combining gloss matching, prompting, image retrieval, and modality fusion for VWSD, utilizing pretrained vision-language models and open datasets.
Findings
System beats nearly half of the participating teams.
Provides insights into multi-modal learning and WSD.
Highlights potential improvements for VWSD tasks.
Abstract
Visual Word Sense Disambiguation (VWSD) is a multi-modal task that aims to select, among a batch of candidate images, the one that best entails the target word's meaning within a limited context. In this paper, we propose a multi-modal retrieval framework that maximally leverages pretrained Vision-Language models, as well as open knowledge bases and datasets. Our system consists of the following key components: (1) Gloss matching: a pretrained bi-encoder model is used to match contexts with proper senses of the target words; (2) Prompting: matched glosses and other textual information, such as synonyms, are incorporated using a prompting template; (3) Image retrieval: semantically matching images are retrieved from large open datasets using prompts as queries; (4) Modality fusion: contextual information from different modalities are fused and used for prediction. Although our system…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques
