TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models
Wenqi Shao, Meng Lei, Yutao Hu, Peng Gao, Kaipeng Zhang, Fanqing Meng,, Peng Xu, Siyuan Huang, Hongsheng Li, Yu Qiao, Ping Luo

TL;DR
This paper introduces Tiny LVLM-eHub, a lightweight, comprehensive evaluation framework for large vision-language models like Google's Bard, assessing multiple multimodal capabilities with improved accuracy and ease of use.
Contribution
It presents a systematic, multi-capability evaluation method for LVLMs, including a new lightweight benchmark and analysis approach that enhances evaluation robustness and practicality.
Findings
Bard outperforms previous LVLMs in most capabilities
Tiny LVLM-eHub evaluates six multimodal categories using 42 benchmarks
The ChatGPT Ensemble Evaluation improves alignment with human judgment
Abstract
Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated significant progress in tackling complex multimodal tasks. Among these cutting-edge developments, Google's Bard stands out for its remarkable multimodal capabilities, promoting comprehensive comprehension and reasoning across various domains. This work presents an early and holistic evaluation of LVLMs' multimodal abilities, with a particular focus on Bard, by proposing a lightweight variant of LVLM-eHub, named Tiny LVLM-eHub. In comparison to the vanilla version, Tiny LVLM-eHub possesses several appealing properties. Firstly, it provides a systematic assessment of six categories of multimodal capabilities, including visual perception, visual knowledge acquisition, visual reasoning, visual commonsense, object hallucination, and embodied intelligence, through quantitative evaluation of standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Subtitles and Audiovisual Media
MethodsFocus
