HKUST at SemEval-2023 Task 1: Visual Word Sense Disambiguation with   Context Augmentation and Visual Assistance

Zhuohao Yin; Xin Huang

arXiv:2311.18273·cs.CV·December 1, 2023·1 cites

HKUST at SemEval-2023 Task 1: Visual Word Sense Disambiguation with Context Augmentation and Visual Assistance

Zhuohao Yin, Xin Huang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multi-modal retrieval framework for Visual Word Sense Disambiguation that leverages pretrained models, knowledge bases, and datasets to improve sense selection by integrating textual and visual information.

Contribution

The paper presents a novel multi-modal retrieval system combining gloss matching, prompting, image retrieval, and modality fusion for VWSD, utilizing pretrained vision-language models and open datasets.

Findings

01

System beats nearly half of the participating teams.

02

Provides insights into multi-modal learning and WSD.

03

Highlights potential improvements for VWSD tasks.

Abstract

Visual Word Sense Disambiguation (VWSD) is a multi-modal task that aims to select, among a batch of candidate images, the one that best entails the target word's meaning within a limited context. In this paper, we propose a multi-modal retrieval framework that maximally leverages pretrained Vision-Language models, as well as open knowledge bases and datasets. Our system consists of the following key components: (1) Gloss matching: a pretrained bi-encoder model is used to match contexts with proper senses of the target words; (2) Prompting: matched glosses and other textual information, such as synonyms, are incorporated using a prompting template; (3) Image retrieval: semantically matching images are retrieved from large open datasets using prompts as queries; (4) Modality fusion: contextual information from different modalities are fused and used for prediction. Although our system…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thomas-yin/semeval-2023-task1
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques