HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks
Fengji Zhang, Linquan Wu, Huiyu Bai, Guancheng Lin, Xiao Li, Xiao Yu,, Yue Wang, Bei Chen, Jacky Keung

TL;DR
HumanEval-V introduces a comprehensive benchmark to evaluate large multimodal models' ability to interpret and reason over complex diagrams in coding tasks, revealing current models' limitations and guiding future improvements.
Contribution
We present HumanEval-V, a novel benchmark with human-annotated diagram-based coding tasks to assess visual reasoning in LMMs, filling a gap in existing evaluation frameworks.
Findings
Top models achieve only around 36.8% pass@1
Models struggle with spatial, topological, and dynamic reasoning
Current LMMs show significant room for improvement in diagram understanding
Abstract
Understanding and reasoning over diagrams is a fundamental aspect of human intelligence. While Large Multimodal Models (LMMs) have demonstrated impressive capabilities across various tasks, existing benchmarks lack comprehensive evaluation of their diagram interpretation and reasoning abilities, particularly in coding contexts. We present HumanEval-V, a rigorous benchmark of human-annotated coding tasks that spans six task types and evaluates diverse visual reasoning capabilities. Each task features carefully crafted diagrams paired with function signatures and test cases, employing novel code generation tasks to thoroughly assess models' diagram comprehension. Through extensive experiments with 22 LMMs, we find that even top-performing models achieve modest success rates, with Claude 3.5 Sonnet reaching only 36.8% pass@1, highlighting substantial room for improvement. Our analysis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning
