TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large   Vision-Language Models

Wenqi Shao; Meng Lei; Yutao Hu; Peng Gao; Kaipeng Zhang; Fanqing Meng,; Peng Xu; Siyuan Huang; Hongsheng Li; Yu Qiao; Ping Luo

arXiv:2308.03729·cs.CV·August 13, 2024·2 cites

TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models

Wenqi Shao, Meng Lei, Yutao Hu, Peng Gao, Kaipeng Zhang, Fanqing Meng,, Peng Xu, Siyuan Huang, Hongsheng Li, Yu Qiao, Ping Luo

PDF

Open Access 1 Repo

TL;DR

This paper introduces Tiny LVLM-eHub, a lightweight, comprehensive evaluation framework for large vision-language models like Google's Bard, assessing multiple multimodal capabilities with improved accuracy and ease of use.

Contribution

It presents a systematic, multi-capability evaluation method for LVLMs, including a new lightweight benchmark and analysis approach that enhances evaluation robustness and practicality.

Findings

01

Bard outperforms previous LVLMs in most capabilities

02

Tiny LVLM-eHub evaluates six multimodal categories using 42 benchmarks

03

The ChatGPT Ensemble Evaluation improves alignment with human judgment

Abstract

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated significant progress in tackling complex multimodal tasks. Among these cutting-edge developments, Google's Bard stands out for its remarkable multimodal capabilities, promoting comprehensive comprehension and reasoning across various domains. This work presents an early and holistic evaluation of LVLMs' multimodal abilities, with a particular focus on Bard, by proposing a lightweight variant of LVLM-eHub, named Tiny LVLM-eHub. In comparison to the vanilla version, Tiny LVLM-eHub possesses several appealing properties. Firstly, it provides a systematic assessment of six categories of multimodal capabilities, including visual perception, visual knowledge acquisition, visual reasoning, visual commonsense, object hallucination, and embodied intelligence, through quantitative evaluation of $42$ standard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opengvlab/multi-modality-arena
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Subtitles and Audiovisual Media

MethodsFocus