How (not) to ensemble LVLMs for VQA
Lisa Alazraki, Lluis Castrejon, Mostafa Dehghani, Fantine Huot, Jasper, Uijlings, Thomas Mensink

TL;DR
This paper critically examines the effectiveness of ensembling diverse Large Vision-Language Models for Visual Question Answering, revealing that practical gains are limited despite theoretical potential.
Contribution
It provides an empirical analysis showing that ensembling LVLMs offers limited real-world improvements despite high theoretical gains.
Findings
Ensembling shows significant theoretical accuracy improvements.
Practical ensemble gains are much smaller than oracle estimates.
Ensembling strategies need careful consideration to be effective.
Abstract
This paper studies ensembling in the era of Large Vision-Language Models (LVLMs). Ensembling is a classical method to combine different models to get increased performance. In the recent work on Encyclopedic-VQA the authors examine a wide variety of models to solve their task: from vanilla LVLMs, to models including the caption as extra context, to models augmented with Lens-based retrieval of Wikipedia pages. Intuitively these models are highly complementary, which should make them ideal for ensembling. Indeed, an oracle experiment shows potential gains from 48.8% accuracy (the best single model) all the way up to 67% (best possible ensemble). So it is a trivial exercise to create an ensemble with substantial real gains. Or is it?
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
