A Survey on Evaluation of Multimodal Large Language Models

Jiaxing Huang; Jingyi Zhang

arXiv:2408.15769·cs.CV·August 29, 2024·3 cites

A Survey on Evaluation of Multimodal Large Language Models

Jiaxing Huang, Jingyi Zhang

PDF

Open Access

TL;DR

This paper systematically reviews evaluation methods for Multimodal Large Language Models, covering tasks, benchmarks, metrics, and insights to guide future research and development in assessing their capabilities.

Contribution

It provides a comprehensive categorization and analysis of existing MLLM evaluation approaches, highlighting key aspects and challenges in the field.

Findings

01

Classification of evaluation tasks by capabilities

02

Summary of general and specific benchmarks

03

Insights into evaluation metrics and methodologies

Abstract

Multimodal Large Language Models (MLLMs) mimic human perception and reasoning system by integrating powerful Large Language Models (LLMs) with various modality encoders (e.g., vision, audio), positioning LLMs as the "brain" and various modality encoders as sensory organs. This framework endows MLLMs with human-like capabilities, and suggests a potential pathway towards achieving artificial general intelligence (AGI). With the emergence of all-round MLLMs like GPT-4V and Gemini, a multitude of evaluation methods have been developed to assess their capabilities across different dimensions. This paper presents a systematic and comprehensive review of MLLM evaluation methods, covering the following key aspects: (1) the background of MLLMs and their evaluation; (2) "what to evaluate" that reviews and categorizes existing MLLM evaluation tasks based on the capabilities assessed, including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling