Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities
Hanbin Wang, Xiaoxuan Zhou, Zhipeng Xu, Keyuan Cheng, Yuxin Zuo, Kai, Tian, Jingwei Song, Junting Lu, Wenhui Hu, Xueyang Liu

TL;DR
Code-Vision is a comprehensive benchmark for assessing the logical understanding and code generation abilities of multimodal large language models, revealing significant performance gaps between proprietary and open-source models.
Contribution
The paper introduces Code-Vision, a new benchmark with diverse tasks to evaluate MLLMs' coding and reasoning skills across multiple domains.
Findings
GPT-4o achieves 79.3% pass@1 on hard problems.
Open-source models lag significantly behind proprietary models.
Code-Vision presents unique challenges compared to existing benchmarks.
Abstract
This paper introduces Code-Vision, a benchmark designed to evaluate the logical understanding and code generation capabilities of Multimodal Large Language Models (MLLMs). It challenges MLLMs to generate a correct program that fulfills specific functionality requirements based on a given flowchart, which visually represents the desired algorithm or process. Code-Vision comprises three subsets: HumanEval-V, Algorithm, and MATH, which evaluate MLLMs' coding abilities across basic programming, algorithmic, and mathematical problem-solving domains. Our experiments evaluate 12 MLLMs on Code-Vision. Experimental results demonstrate that there is a large performance difference between proprietary and open-source models. On Hard problems, GPT-4o can achieve 79.3% pass@1, but the best open-source model only achieves 15%. Further experiments reveal that Code-Vision can pose unique challenges…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Speech and dialogue systems
