SADL: An Effective In-Context Learning Method for Compositional Visual QA
Long Hoang Dang, Thao Minh Le, Vuong Le, Tu Minh Phuong, Truyen Tran

TL;DR
SADL is a novel visual-linguistic prompting framework that improves compositional Visual QA by sampling, decomposing, and pseudo-labeling image-question pairs, addressing the semantic gap in vision-language models.
Contribution
This paper introduces SADL, a new prompting method for compositional Visual QA that leverages sampling, question decomposition, and pseudo-labeling to enhance model performance.
Findings
Sampling in semantic proximity improves accuracy.
Decomposing complex questions aids understanding.
Pseudo-labeling enhances training data quality.
Abstract
Large vision-language models (LVLMs) offer a novel capability for performing in-context learning (ICL) in Visual QA. When prompted with a few demonstrations of image-question-answer triplets, LVLMs have demonstrated the ability to discern underlying patterns and transfer this latent knowledge to answer new questions about unseen images without the need for expensive supervised fine-tuning. However, designing effective vision-language prompts, especially for compositional questions, remains poorly understood. Adapting language-only ICL techniques may not necessarily work because we need to bridge the visual-linguistic semantic gap: Symbolic concepts must be grounded in visual content, which does not share the syntactic linguistic structures. This paper introduces SADL, a new visual-linguistic prompting framework for the task. SADL revolves around three key components: SAmpling,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Visual Attention and Saliency Detection
MethodsALIGN
