LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models

Ruilin Yao; Bo Zhang; Jirui Huang; Xinwei Long; Yifang Zhang; Tianyu Zou; Yufei Wu; Shichao Su; Yifan Xu; Wenxi Zeng; Zhaoyu Yang; Guoyou Li; Shilan Zhang; Zichan Li; Yaxiong Chen; Shengwu Xiong; Peng Xu; Jiajun Zhang; Bowen Zhou; David Clifton; Luc Van Gool

arXiv:2505.15616·cs.CV·May 14, 2026

LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models

Ruilin Yao, Bo Zhang, Jirui Huang, Xinwei Long, Yifang Zhang, Tianyu Zou, Yufei Wu, Shichao Su, Yifan Xu, Wenxi Zeng, Zhaoyu Yang, Guoyou Li, Shilan Zhang, Zichan Li, Yaxiong Chen, Shengwu Xiong, Peng Xu, Jiajun Zhang, Bowen Zhou, David Clifton, Luc Van Gool

PDF

1 Repo 1 Video

TL;DR

Lens introduces a multi-level benchmark with rich annotations to evaluate multimodal large language models across perception, understanding, and reasoning tasks, highlighting their current limitations in complex real-world scenarios.

Contribution

The paper presents a new comprehensive benchmark dataset with multi-tiered tasks and rich annotations to better evaluate MLLMs' reasoning capabilities in diverse scenarios.

Findings

01

None of the evaluated models surpass 60% accuracy in reasoning tasks.

02

The dataset covers 8 tasks and 12 daily scenarios with 3 progressive tiers.

03

Models released after Dec. 2024 perform poorly on complex reasoning.

Abstract

Multimodal Large Language Models (MLLMs) have achieved significant advances in integrating visual and linguistic information, yet their ability to reason about complex and real-world scenarios remains limited. The existing benchmarks are usually constructed in the task-oriented manner without guarantee that different task samples come from the same data distribution, thus they often fall short in evaluating the synergistic effects of lower-level perceptual capabilities on higher-order reasoning. To lift this limitation, we contribute Lens, a multi-level benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios, forming three progressive task tiers, i.e., perception, understanding, and reasoning. One feature is that each image is equipped with rich annotations for all tasks. Thus, this dataset intrinsically supports to evaluate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Lens4MLLMs/lens
github

Videos

LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models· slideslive