Evaluating and Advancing Multimodal Large Language Models in Perception Ability Lens

Feng Chen; Chenhui Gou; Jing Liu; Yang Yang; Zhaoyang Li; Jiyuan Zhang; Zhenbang Sun; Bohan Zhuang; Qi Wu

arXiv:2411.14725·cs.CV·June 4, 2025

Evaluating and Advancing Multimodal Large Language Models in Perception Ability Lens

Feng Chen, Chenhui Gou, Jing Liu, Yang Yang, Zhaoyang Li, Jiyuan Zhang, Zhenbang Sun, Bohan Zhuang, Qi Wu

PDF

Open Access

TL;DR

This paper introduces AbilityLens, a unified benchmark for evaluating vision perception abilities of multimodal large language models, revealing strengths, weaknesses, and training phenomena to guide future development.

Contribution

The paper presents AbilityLens, a comprehensive and robust benchmark for perception abilities in MLLMs, addressing evaluation variance and providing insights into model performance and training dynamics.

Findings

01

Identifies performance gaps between open-source and closed-source MLLMs.

02

Reveals stability patterns and ability conflicts during training.

03

Suggests fine-tuning and model merging as strategies to mitigate ability conflicts.

Abstract

As multimodal large language models (MLLMs) advance rapidly, rigorous evaluation has become essential, providing further guidance for their development. In this work, we focus on a unified and robust evaluation of \textbf{vision perception} abilities, the foundational skill of MLLMs. We find that existing perception benchmarks, each focusing on different question types, domains, and evaluation metrics, introduce significant evaluation variance, complicating comprehensive assessments of perception abilities when relying on any single benchmark. To address this, we introduce \textbf{AbilityLens}, a unified benchmark designed to evaluate MLLMs in six key perception abilities (ranging from counting, OCR, to understanding structural data), focusing on both accuracy and stability, with each ability encompassing diverse types of questions, domains, and metrics. With the assistance of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Natural Language Processing Techniques · Topic Modeling

MethodsFocus