A Survey on Evaluation of Multimodal Large Language Models
Jiaxing Huang, Jingyi Zhang

TL;DR
This paper systematically reviews evaluation methods for Multimodal Large Language Models, covering tasks, benchmarks, metrics, and insights to guide future research and development in assessing their capabilities.
Contribution
It provides a comprehensive categorization and analysis of existing MLLM evaluation approaches, highlighting key aspects and challenges in the field.
Findings
Classification of evaluation tasks by capabilities
Summary of general and specific benchmarks
Insights into evaluation metrics and methodologies
Abstract
Multimodal Large Language Models (MLLMs) mimic human perception and reasoning system by integrating powerful Large Language Models (LLMs) with various modality encoders (e.g., vision, audio), positioning LLMs as the "brain" and various modality encoders as sensory organs. This framework endows MLLMs with human-like capabilities, and suggests a potential pathway towards achieving artificial general intelligence (AGI). With the emergence of all-round MLLMs like GPT-4V and Gemini, a multitude of evaluation methods have been developed to assess their capabilities across different dimensions. This paper presents a systematic and comprehensive review of MLLM evaluation methods, covering the following key aspects: (1) the background of MLLMs and their evaluation; (2) "what to evaluate" that reviews and categorizes existing MLLM evaluation tasks based on the capabilities assessed, including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
