SADL: An Effective In-Context Learning Method for Compositional Visual   QA

Long Hoang Dang; Thao Minh Le; Vuong Le; Tu Minh Phuong; Truyen Tran

arXiv:2407.01983·cs.CV·July 3, 2024

SADL: An Effective In-Context Learning Method for Compositional Visual QA

Long Hoang Dang, Thao Minh Le, Vuong Le, Tu Minh Phuong, Truyen Tran

PDF

Open Access

TL;DR

SADL is a novel visual-linguistic prompting framework that improves compositional Visual QA by sampling, decomposing, and pseudo-labeling image-question pairs, addressing the semantic gap in vision-language models.

Contribution

This paper introduces SADL, a new prompting method for compositional Visual QA that leverages sampling, question decomposition, and pseudo-labeling to enhance model performance.

Findings

01

Sampling in semantic proximity improves accuracy.

02

Decomposing complex questions aids understanding.

03

Pseudo-labeling enhances training data quality.

Abstract

Large vision-language models (LVLMs) offer a novel capability for performing in-context learning (ICL) in Visual QA. When prompted with a few demonstrations of image-question-answer triplets, LVLMs have demonstrated the ability to discern underlying patterns and transfer this latent knowledge to answer new questions about unseen images without the need for expensive supervised fine-tuning. However, designing effective vision-language prompts, especially for compositional questions, remains poorly understood. Adapting language-only ICL techniques may not necessarily work because we need to bridge the visual-linguistic semantic gap: Symbolic concepts must be grounded in visual content, which does not share the syntactic linguistic structures. This paper introduces SADL, a new visual-linguistic prompting framework for the task. SADL revolves around three key components: SAmpling,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Visual Attention and Saliency Detection

MethodsALIGN