NPHardEval4V: Dynamic Evaluation of Large Vision-Language Models with Effects of Vision

Xiang Li; Wenyue Hua; Kaijie Zhu; Lingyao Li; Haoyang Ling; Jinkui Chi; Qi Dou; Jindong Wang; Yongfeng Zhang; Xin Ma; Lizhou Fan

arXiv:2403.01777·cs.CL·August 28, 2025·1 cites

NPHardEval4V: Dynamic Evaluation of Large Vision-Language Models with Effects of Vision

Xiang Li, Wenyue Hua, Kaijie Zhu, Lingyao Li, Haoyang Ling, Jinkui Chi, Qi Dou, Jindong Wang, Yongfeng Zhang, Xin Ma, Lizhou Fan

PDF

Open Access 1 Repo

TL;DR

NPHardEval4V introduces a new multimodal benchmark based on NP-hard problems to evaluate the reasoning capabilities of large vision-language models, revealing their limitations in complex combinatorial tasks.

Contribution

This work presents NPHardEval4V, a novel benchmark suite that assesses LVLMs on structured, logic-driven problems combining visual and linguistic reasoning, filling a gap in existing evaluation methods.

Findings

01

Models perform well on perception but struggle with optimization and abstraction.

02

No single model shows consistent reasoning across all NP-hard tasks.

03

Current architectures have fundamental limitations in complex combinatorial reasoning.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multimodal understanding, yet their reasoning abilities remain underexplored. Existing benchmarks tend to focus on perception or text-based comprehension, offering limited insight into how well these models perform on structured, logic-driven tasks that require both visual and linguistic reasoning. To address this gap, we introduce NPHardEval4V, a multimodal benchmark suite grounded in four classical NP-hard problems: Knapsack, Set Cover, Traveling Salesperson, and Vertex Cover. Each task is presented through a combination of structured visual layouts and textual prompts, designed to assess the ability of LVLMs to perform combinatorial reasoning under visual-linguistic constraints. We evaluate a set of advanced open-source and closed-source vision-language models under a unified prompting and problem…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lizhouf/nphardeval4v
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsFocus