Revisiting Multi-Modal LLM Evaluation

Jian Lu; Shikhar Srivastava; Junyu Chen; Robik Shrestha; Manoj; Acharya; Kushal Kafle; Christopher Kanan

arXiv:2408.05334·cs.AI·August 13, 2024

Revisiting Multi-Modal LLM Evaluation

Jian Lu, Shikhar Srivastava, Junyu Chen, Robik Shrestha, Manoj, Acharya, Kushal Kafle, Christopher Kanan

PDF

TL;DR

This paper critically evaluates recent multi-modal large language models using datasets designed to overcome earlier biases, revealing new weaknesses and providing a framework for future assessments.

Contribution

It introduces a comprehensive evaluation of recent MLLMs on improved datasets, highlighting previously unreported weaknesses and integrating the evaluation into the LAVIS framework.

Findings

01

Revealed new weaknesses in recent MLLMs

02

Demonstrated limitations of existing datasets for fine-grained analysis

03

Provided a framework for future MLLM evaluation

Abstract

With the advent of multi-modal large language models (MLLMs), datasets used for visual question answering (VQA) and referring expression comprehension have seen a resurgence. However, the most popular datasets used to evaluate MLLMs are some of the earliest ones created, and they have many known problems, including extreme bias, spurious correlations, and an inability to permit fine-grained analysis. In this paper, we pioneer evaluating recent MLLMs (LLaVA 1.5, LLaVA-NeXT, BLIP2, InstructBLIP, GPT-4V, and GPT-4o) on datasets designed to address weaknesses in earlier ones. We assess three VQA datasets: 1) TDIUC, which permits fine-grained analysis on 12 question types; 2) TallyQA, which has simple and complex counting questions; and 3) DVQA, which requires optical character recognition for chart understanding. We also study VQDv1, a dataset that requires identifying all image regions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.