NPHardEval4V: Dynamic Evaluation of Large Vision-Language Models with Effects of Vision
Xiang Li, Wenyue Hua, Kaijie Zhu, Lingyao Li, Haoyang Ling, Jinkui Chi, Qi Dou, Jindong Wang, Yongfeng Zhang, Xin Ma, Lizhou Fan

TL;DR
NPHardEval4V introduces a new multimodal benchmark based on NP-hard problems to evaluate the reasoning capabilities of large vision-language models, revealing their limitations in complex combinatorial tasks.
Contribution
This work presents NPHardEval4V, a novel benchmark suite that assesses LVLMs on structured, logic-driven problems combining visual and linguistic reasoning, filling a gap in existing evaluation methods.
Findings
Models perform well on perception but struggle with optimization and abstraction.
No single model shows consistent reasoning across all NP-hard tasks.
Current architectures have fundamental limitations in complex combinatorial reasoning.
Abstract
Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multimodal understanding, yet their reasoning abilities remain underexplored. Existing benchmarks tend to focus on perception or text-based comprehension, offering limited insight into how well these models perform on structured, logic-driven tasks that require both visual and linguistic reasoning. To address this gap, we introduce NPHardEval4V, a multimodal benchmark suite grounded in four classical NP-hard problems: Knapsack, Set Cover, Traveling Salesperson, and Vertex Cover. Each task is presented through a combination of structured visual layouts and textual prompts, designed to assess the ability of LVLMs to perform combinatorial reasoning under visual-linguistic constraints. We evaluate a set of advanced open-source and closed-source vision-language models under a unified prompting and problem…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsFocus
