Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches
Igor Sterner, Weizhe Lin, Jinghong Chen, Bill Byrne

TL;DR
This paper compares two main methods of integrating images into large language models for few-shot visual question answering, revealing that captioning often outperforms direct embedding mapping, especially in zero-shot scenarios.
Contribution
The study provides a controlled comparison between captioning and embedding mapping approaches for few-shot VQA with LLMs, highlighting the impact of in-context example selection.
Findings
Captioning outperforms embedding mapping in zero-shot VQA.
Embedding mapping does not guarantee better performance over captioning.
In few-shot regimes, in-context example selection influences which approach performs better.
Abstract
Two approaches have emerged to input images into large language models (LLMs). The first is to caption images into natural language. The second is to map image feature embeddings into the domain of the LLM and pass the mapped embeddings directly to the LLM. The majority of recent few-shot multimodal work reports performance using architectures that employ variations of one of these two approaches. But they overlook an important comparison between them. We design a controlled and focused experiment to compare these two approaches to few-shot visual question answering (VQA) with LLMs. Our findings indicate that for Flan-T5 XL, a 3B parameter LLM, connecting visual embeddings directly to the LLM embedding space does not guarantee improved performance over using image captions. In the zero-shot regime, we find using textual image captions is better. In the few-shot regimes, how the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHydrocarbon exploration and reservoir analysis · Medical Imaging Techniques and Applications · Advanced X-ray and CT Imaging
MethodsFlan-T5
