LVLM-eHub: A Comprehensive Evaluation Benchmark for Large   Vision-Language Models

Peng Xu; Wenqi Shao; Kaipeng Zhang; Peng Gao; Shuo Liu; Meng Lei,; Fanqing Meng; Siyuan Huang; Yu Qiao; Ping Luo

arXiv:2306.09265·cs.CV·June 16, 2023·20 cites

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei,, Fanqing Meng, Siyuan Huang, Yu Qiao, Ping Luo

PDF

Open Access 1 Repo

TL;DR

This paper introduces LVLM-eHub, a comprehensive benchmark for evaluating large vision-language models across multiple capabilities and scenarios, revealing overfitting, hallucination issues, and proposing solutions for better assessment.

Contribution

It presents a new holistic evaluation framework and benchmark for LVLMs, including diverse tests and an online arena, to better understand their capabilities and limitations.

Findings

01

Instruction-tuned LVLMs overfit tasks and generalize poorly.

02

Moderate instruction data can cause object hallucination issues.

03

Multi-turn reasoning evaluation helps mitigate hallucination problems.

Abstract

Large Vision-Language Models (LVLMs) have recently played a dominant role in multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation of their efficacy. This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub). Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform. The former evaluates $6$ categories of multimodal capabilities of LVLMs such as visual question answering and embodied artificial intelligence on $47$ standard text-related visual benchmarks, while the latter provides the user-level evaluation of LVLMs in an open-world question-answering scenario. The study reveals several innovative findings. First, instruction-tuned LVLM with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opengvlab/multi-modality-arena
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications