Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code   Generation Capabilities

Hanbin Wang; Xiaoxuan Zhou; Zhipeng Xu; Keyuan Cheng; Yuxin Zuo; Kai; Tian; Jingwei Song; Junting Lu; Wenhui Hu; Xueyang Liu

arXiv:2502.11829·cs.CL·February 18, 2025

Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities

Hanbin Wang, Xiaoxuan Zhou, Zhipeng Xu, Keyuan Cheng, Yuxin Zuo, Kai, Tian, Jingwei Song, Junting Lu, Wenhui Hu, Xueyang Liu

PDF

Open Access 1 Repo

TL;DR

Code-Vision is a comprehensive benchmark for assessing the logical understanding and code generation abilities of multimodal large language models, revealing significant performance gaps between proprietary and open-source models.

Contribution

The paper introduces Code-Vision, a new benchmark with diverse tasks to evaluate MLLMs' coding and reasoning skills across multiple domains.

Findings

01

GPT-4o achieves 79.3% pass@1 on hard problems.

02

Open-source models lag significantly behind proprietary models.

03

Code-Vision presents unique challenges compared to existing benchmarks.

Abstract

This paper introduces Code-Vision, a benchmark designed to evaluate the logical understanding and code generation capabilities of Multimodal Large Language Models (MLLMs). It challenges MLLMs to generate a correct program that fulfills specific functionality requirements based on a given flowchart, which visually represents the desired algorithm or process. Code-Vision comprises three subsets: HumanEval-V, Algorithm, and MATH, which evaluate MLLMs' coding abilities across basic programming, algorithmic, and mathematical problem-solving domains. Our experiments evaluate 12 MLLMs on Code-Vision. Experimental results demonstrate that there is a large performance difference between proprietary and open-source models. On Hard problems, GPT-4o can achieve 79.3% pass@1, but the best open-source model only achieves 15%. Further experiments reveal that Code-Vision can pose unique challenges…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wanghanbinpanda/codevision
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Speech and dialogue systems