Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches

Igor Sterner; Weizhe Lin; Jinghong Chen; Bill Byrne

arXiv:2403.11317·cs.CL·March 19, 2024·1 cites

Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches

Igor Sterner, Weizhe Lin, Jinghong Chen, Bill Byrne

PDF

Open Access

TL;DR

This paper compares two main methods of integrating images into large language models for few-shot visual question answering, revealing that captioning often outperforms direct embedding mapping, especially in zero-shot scenarios.

Contribution

The study provides a controlled comparison between captioning and embedding mapping approaches for few-shot VQA with LLMs, highlighting the impact of in-context example selection.

Findings

01

Captioning outperforms embedding mapping in zero-shot VQA.

02

Embedding mapping does not guarantee better performance over captioning.

03

In few-shot regimes, in-context example selection influences which approach performs better.

Abstract

Two approaches have emerged to input images into large language models (LLMs). The first is to caption images into natural language. The second is to map image feature embeddings into the domain of the LLM and pass the mapped embeddings directly to the LLM. The majority of recent few-shot multimodal work reports performance using architectures that employ variations of one of these two approaches. But they overlook an important comparison between them. We design a controlled and focused experiment to compare these two approaches to few-shot visual question answering (VQA) with LLMs. Our findings indicate that for Flan-T5 XL, a 3B parameter LLM, connecting visual embeddings directly to the LLM embedding space does not guarantee improved performance over using image captions. In the zero-shot regime, we find using textual image captions is better. In the few-shot regimes, how the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHydrocarbon exploration and reservoir analysis · Medical Imaging Techniques and Applications · Advanced X-ray and CT Imaging

MethodsFlan-T5