How (not) to ensemble LVLMs for VQA

Lisa Alazraki; Lluis Castrejon; Mostafa Dehghani; Fantine Huot; Jasper; Uijlings; Thomas Mensink

arXiv:2310.06641·cs.CV·December 8, 2023

How (not) to ensemble LVLMs for VQA

Lisa Alazraki, Lluis Castrejon, Mostafa Dehghani, Fantine Huot, Jasper, Uijlings, Thomas Mensink

PDF

Open Access

TL;DR

This paper critically examines the effectiveness of ensembling diverse Large Vision-Language Models for Visual Question Answering, revealing that practical gains are limited despite theoretical potential.

Contribution

It provides an empirical analysis showing that ensembling LVLMs offers limited real-world improvements despite high theoretical gains.

Findings

01

Ensembling shows significant theoretical accuracy improvements.

02

Practical ensemble gains are much smaller than oracle estimates.

03

Ensembling strategies need careful consideration to be effective.

Abstract

This paper studies ensembling in the era of Large Vision-Language Models (LVLMs). Ensembling is a classical method to combine different models to get increased performance. In the recent work on Encyclopedic-VQA the authors examine a wide variety of models to solve their task: from vanilla LVLMs, to models including the caption as extra context, to models augmented with Lens-based retrieval of Wikipedia pages. Intuitively these models are highly complementary, which should make them ideal for ensembling. Indeed, an oracle experiment shows potential gains from 48.8% accuracy (the best single model) all the way up to 67% (best possible ensemble). So it is a trivial exercise to create an ensemble with substantial real gains. Or is it?

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling