HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex   Diagrams in Coding Tasks

Fengji Zhang; Linquan Wu; Huiyu Bai; Guancheng Lin; Xiao Li; Xiao Yu,; Yue Wang; Bei Chen; Jacky Keung

arXiv:2410.12381·cs.CV·February 19, 2025

HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks

Fengji Zhang, Linquan Wu, Huiyu Bai, Guancheng Lin, Xiao Li, Xiao Yu,, Yue Wang, Bei Chen, Jacky Keung

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

HumanEval-V introduces a comprehensive benchmark to evaluate large multimodal models' ability to interpret and reason over complex diagrams in coding tasks, revealing current models' limitations and guiding future improvements.

Contribution

We present HumanEval-V, a novel benchmark with human-annotated diagram-based coding tasks to assess visual reasoning in LMMs, filling a gap in existing evaluation frameworks.

Findings

01

Top models achieve only around 36.8% pass@1

02

Models struggle with spatial, topological, and dynamic reasoning

03

Current LMMs show significant room for improvement in diagram understanding

Abstract

Understanding and reasoning over diagrams is a fundamental aspect of human intelligence. While Large Multimodal Models (LMMs) have demonstrated impressive capabilities across various tasks, existing benchmarks lack comprehensive evaluation of their diagram interpretation and reasoning abilities, particularly in coding contexts. We present HumanEval-V, a rigorous benchmark of human-annotated coding tasks that spans six task types and evaluates diverse visual reasoning capabilities. Each task features carefully crafted diagrams paired with function signatures and test cases, employing novel code generation tasks to thoroughly assess models' diagram comprehension. Through extensive experiments with 22 LMMs, we find that even top-performing models achieve modest success rates, with Claude 3.5 Sonnet reaching only 36.8% pass@1, highlighting substantial room for improvement. Our analysis…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HumanEval-V/HumanEval-V-Benchmark
noneOfficial

Models

🤗
iSolver-AI/FEnet
model· 54 dl
54 dl

Datasets

HumanEval-V/HumanEval-V-Benchmark
dataset· 177 dl
177 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning