Oedipus and the Sphinx: Benchmarking and Improving Visual Language Models for Complex Graphic Reasoning

Jianyi Zhang; Xu Ji; Ziyin Zhou; Yuchen Zhou; Shubo Shi; Haoyu Wu; Zhen Li; Shizhao Liu

arXiv:2508.00323·cs.AI·August 4, 2025

Oedipus and the Sphinx: Benchmarking and Improving Visual Language Models for Complex Graphic Reasoning

Jianyi Zhang, Xu Ji, Ziyin Zhou, Yuchen Zhou, Shubo Shi, Haoyu Wu, Zhen Li, Shizhao Liu

PDF

Open Access 1 Datasets

TL;DR

This paper introduces ReasonBench, a comprehensive benchmark for evaluating visual language models on complex graphic reasoning tasks, revealing current limitations and proposing strategies to significantly improve their reasoning capabilities.

Contribution

It presents the first structured benchmark for complex graphic reasoning in VLMs and introduces dual optimization strategies that enhance model performance and interpretability.

Findings

01

VLMs show significant limitations in complex graphic reasoning.

02

The proposed strategies improve VLM performance by 33.5%.

03

ReasonBench covers diverse reasoning dimensions and real-world questions.

Abstract

Evaluating the performance of visual language models (VLMs) in graphic reasoning tasks has become an important research topic. However, VLMs still show obvious deficiencies in simulating human-level graphic reasoning capabilities, especially in complex graphic reasoning and abstract problem solving, which are less studied and existing studies only focus on simple graphics. To evaluate the performance of VLMs in complex graphic reasoning, we propose ReasonBench, the first evaluation benchmark focused on structured graphic reasoning tasks, which includes 1,613 questions from real-world intelligence tests. ReasonBench covers reasoning dimensions related to location, attribute, quantity, and multi-element tasks, providing a comprehensive evaluation of the performance of VLMs in spatial, relational, and abstract reasoning capabilities. We benchmark 11 mainstream VLMs (including closed-source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

cistine/ReasonBench
dataset· 62 dl
62 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Data Visualization and Analytics · Spatial Cognition and Navigation