Augmenters at SemEval-2023 Task 1: Enhancing CLIP in Handling   Compositionality and Ambiguity for Zero-Shot Visual WSD through Prompt   Augmentation and Text-To-Image Diffusion

Jie S. Li; Yow-Ting Shiue; Yong-Siang Shih; and Jonas Geiping

arXiv:2307.05564·cs.CL·July 13, 2023

Augmenters at SemEval-2023 Task 1: Enhancing CLIP in Handling Compositionality and Ambiguity for Zero-Shot Visual WSD through Prompt Augmentation and Text-To-Image Diffusion

Jie S. Li, Yow-Ting Shiue, Yong-Siang Shih, and Jonas Geiping

PDF

Open Access

TL;DR

This paper presents zero-shot methods using prompt augmentation and diffusion models to improve CLIP's handling of compositionality and ambiguity in Visual Word Sense Disambiguation, achieving better image-text matching.

Contribution

Introduces Augment-CLIP and SD Sampling systems that enhance CLIP's performance in VWSD by addressing compositionality and ambiguity with prompt and image augmentation techniques.

Findings

01

Augment-CLIP improves phrase-image matching accuracy.

02

SD Sampling increases the likelihood of matching relevant images.

03

Multilingual CLIP models help disambiguate words through translation.

Abstract

This paper describes our zero-shot approaches for the Visual Word Sense Disambiguation (VWSD) Task in English. Our preliminary study shows that the simple approach of matching candidate images with the phrase using CLIP suffers from the many-to-many nature of image-text pairs. We find that the CLIP text encoder may have limited abilities in capturing the compositionality in natural language. Conversely, the descriptive focus of the phrase varies from instance to instance. We address these issues in our two systems, Augment-CLIP and Stable Diffusion Sampling (SD Sampling). Augment-CLIP augments the text prompt by generating sentences that contain the context phrase with the help of large language models (LLMs). We further explore CLIP models in other languages, as the an ambiguous word may be translated into an unambiguous one in the other language. SD Sampling uses text-to-image Stable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsFocus · Diffusion · Contrastive Language-Image Pre-training