Exposing Weaknesses of Large Reasoning Models through Graph Algorithm Problems
Qifan Zhang, Jianhao Ruan, Aochuan Chen, Kang Zeng, Nuo Chen, Jing Tang, Jia Li

TL;DR
This paper introduces GrAlgoBench, a new benchmark using graph algorithm problems to evaluate large reasoning models, revealing their weaknesses in handling long contexts and over-verifying solutions.
Contribution
It presents a novel benchmark for reasoning models based on graph algorithms, highlighting key limitations in current models' accuracy and reasoning strategies.
Findings
Accuracy drops below 50% with graphs over 120 nodes
Models frequently make execution errors and have weak memory
Over-verification leads to inflated reasoning traces without correctness
Abstract
Large Reasoning Models (LRMs) have advanced rapidly; however, existing benchmarks in mathematics, code, and common-sense reasoning remain limited. They lack long-context evaluation, offer insufficient challenge, and provide answers that are difficult to verify programmatically. We introduce GrAlgoBench, a benchmark designed to evaluate LRMs through graph algorithm problems. Such problems are particularly well suited for probing reasoning abilities: they demand long-context reasoning, allow fine-grained control of difficulty levels, and enable standardized, programmatic evaluation. Across nine tasks, our systematic experiments reveal two major weaknesses of current LRMs. First, accuracy deteriorates sharply as context length increases, falling below 50% once graphs exceed 120 nodes. This degradation is driven by frequent execution errors, weak memory, and redundant reasoning. Second,…
Peer Reviews
Decision·ICLR 2026 Poster
The nine tasks are precisely specified (with optimal algorithms/complexities in the appendix), and the dataset scales graph sizes systematically (Level-1…Level-6), which is valuable for probing context-length effects in a controlled way.
1. The core taxonomy (Enumeration / Exploration / Intuition) is *empirically* assigned by first generating 100 ER instances per problem, collecting LRM responses, and then asking **Qwen-2.5-72B** to classify which algorithmic family the *responses* reflect; the team then picks 9 tasks whose “algorithms are relatively unambiguous.” This risks circularity (taxonomy depends on current LRM behaviors) and injects judge-model bias into the ground truth of the benchmark’s conceptual framing. It also un
The paper is well-written and clear, and the experiments are designed carefully (e.g., a diverse selection of open- and closed-source frontier models, a strong qualitative study, and error analysis). Moreover, the code is available, which aids in the reproducibility of the work. * **Task design:** The reasoning taxonomy is well-designed and well-motivated, categorizing reasoning types into enumeration, exploration, and intuition, which appears to be a strong differentiator compared to prior wor
* One minor weakness is that the **fixed assignments of Enumeration, Exploration, and Intuition reasoning labels** to tasks might be slightly misleading, as models might sometimes use different reasoning to solve a task (e.g., doing Exploration instead of Intuition). While Appendix H.1 does a great job showing this issue would be rare, it also shows such an issue exists. Furthermore, these ablations are done by using an LLM-as-judge on 100 problem instances that seem to differ from the wide rang
1. Well-motivated benchmark & taxonomy. The bridge from CLRS-style algorithmic families to the Enumeration / Exploration / Intuition taxonomy is clear and useful for reasoning analysis. 2. Real-world graph sources & scaling. Instances are derived from DBLP, street networks (multiple cities), OpenFlights, Wikipedia, and DBpedia; each task is generated across six scales to modulate difficulty, supporting long-context evaluation without synthetic toy artifacts. 3. Comprehensive evaluation & clear
1. LLM-as-judge dependence. Error categorization and over-thinking judgments rely on LLM pipelines. The paper would benefit from reporting inter-annotator agreement with humans on a stratified subset to quantify labeling reliability and possible bias. 2. On line 72 you note that GraphWalks is intended to benchmark LRMs’ long-context capabilities. Please clarify how your benchmark compares to GraphWalks in scope, task design, evaluation protocol, and key findings. In particular, explain what is
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Explainable Artificial Intelligence (XAI) · Graph Theory and Algorithms
