Revisiting Multi-Modal LLM Evaluation
Jian Lu, Shikhar Srivastava, Junyu Chen, Robik Shrestha, Manoj, Acharya, Kushal Kafle, Christopher Kanan

TL;DR
This paper critically evaluates recent multi-modal large language models using datasets designed to overcome earlier biases, revealing new weaknesses and providing a framework for future assessments.
Contribution
It introduces a comprehensive evaluation of recent MLLMs on improved datasets, highlighting previously unreported weaknesses and integrating the evaluation into the LAVIS framework.
Findings
Revealed new weaknesses in recent MLLMs
Demonstrated limitations of existing datasets for fine-grained analysis
Provided a framework for future MLLM evaluation
Abstract
With the advent of multi-modal large language models (MLLMs), datasets used for visual question answering (VQA) and referring expression comprehension have seen a resurgence. However, the most popular datasets used to evaluate MLLMs are some of the earliest ones created, and they have many known problems, including extreme bias, spurious correlations, and an inability to permit fine-grained analysis. In this paper, we pioneer evaluating recent MLLMs (LLaVA 1.5, LLaVA-NeXT, BLIP2, InstructBLIP, GPT-4V, and GPT-4o) on datasets designed to address weaknesses in earlier ones. We assess three VQA datasets: 1) TDIUC, which permits fine-grained analysis on 12 question types; 2) TallyQA, which has simple and complex counting questions; and 3) DVQA, which requires optical character recognition for chart understanding. We also study VQDv1, a dataset that requires identifying all image regions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
