Exposing Weaknesses of Large Reasoning Models through Graph Algorithm Problems

Qifan Zhang; Jianhao Ruan; Aochuan Chen; Kang Zeng; Nuo Chen; Jing Tang; Jia Li

arXiv:2602.06319·cs.AI·February 9, 2026

Exposing Weaknesses of Large Reasoning Models through Graph Algorithm Problems

Qifan Zhang, Jianhao Ruan, Aochuan Chen, Kang Zeng, Nuo Chen, Jing Tang, Jia Li

PDF

Open Access 3 Reviews

TL;DR

This paper introduces GrAlgoBench, a new benchmark using graph algorithm problems to evaluate large reasoning models, revealing their weaknesses in handling long contexts and over-verifying solutions.

Contribution

It presents a novel benchmark for reasoning models based on graph algorithms, highlighting key limitations in current models' accuracy and reasoning strategies.

Findings

01

Accuracy drops below 50% with graphs over 120 nodes

02

Models frequently make execution errors and have weak memory

03

Over-verification leads to inflated reasoning traces without correctness

Abstract

Large Reasoning Models (LRMs) have advanced rapidly; however, existing benchmarks in mathematics, code, and common-sense reasoning remain limited. They lack long-context evaluation, offer insufficient challenge, and provide answers that are difficult to verify programmatically. We introduce GrAlgoBench, a benchmark designed to evaluate LRMs through graph algorithm problems. Such problems are particularly well suited for probing reasoning abilities: they demand long-context reasoning, allow fine-grained control of difficulty levels, and enable standardized, programmatic evaluation. Across nine tasks, our systematic experiments reveal two major weaknesses of current LRMs. First, accuracy deteriorates sharply as context length increases, falling below 50% once graphs exceed 120 nodes. This degradation is driven by frequent execution errors, weak memory, and redundant reasoning. Second,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

The nine tasks are precisely specified (with optimal algorithms/complexities in the appendix), and the dataset scales graph sizes systematically (Level-1…Level-6), which is valuable for probing context-length effects in a controlled way.

Weaknesses

1. The core taxonomy (Enumeration / Exploration / Intuition) is *empirically* assigned by first generating 100 ER instances per problem, collecting LRM responses, and then asking **Qwen-2.5-72B** to classify which algorithmic family the *responses* reflect; the team then picks 9 tasks whose “algorithms are relatively unambiguous.” This risks circularity (taxonomy depends on current LRM behaviors) and injects judge-model bias into the ground truth of the benchmark’s conceptual framing. It also un

Reviewer 02Rating 8Confidence 4

Strengths

The paper is well-written and clear, and the experiments are designed carefully (e.g., a diverse selection of open- and closed-source frontier models, a strong qualitative study, and error analysis). Moreover, the code is available, which aids in the reproducibility of the work. * **Task design:** The reasoning taxonomy is well-designed and well-motivated, categorizing reasoning types into enumeration, exploration, and intuition, which appears to be a strong differentiator compared to prior wor

Weaknesses

* One minor weakness is that the **fixed assignments of Enumeration, Exploration, and Intuition reasoning labels** to tasks might be slightly misleading, as models might sometimes use different reasoning to solve a task (e.g., doing Exploration instead of Intuition). While Appendix H.1 does a great job showing this issue would be rare, it also shows such an issue exists. Furthermore, these ablations are done by using an LLM-as-judge on 100 problem instances that seem to differ from the wide rang

Reviewer 03Rating 6Confidence 4

Strengths

1. Well-motivated benchmark & taxonomy. The bridge from CLRS-style algorithmic families to the Enumeration / Exploration / Intuition taxonomy is clear and useful for reasoning analysis. 2. Real-world graph sources & scaling. Instances are derived from DBLP, street networks (multiple cities), OpenFlights, Wikipedia, and DBpedia; each task is generated across six scales to modulate difficulty, supporting long-context evaluation without synthetic toy artifacts. 3. Comprehensive evaluation & clear

Weaknesses

1. LLM-as-judge dependence. Error categorization and over-thinking judgments rely on LLM pipelines. The paper would benefit from reporting inter-annotator agreement with humans on a stratified subset to quantify labeling reliability and possible bias. 2. On line 72 you note that GraphWalks is intended to benchmark LRMs’ long-context capabilities. Please clarify how your benchmark compares to GraphWalks in scope, task design, evaluation protocol, and key findings. In particular, explain what is

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Explainable Artificial Intelligence (XAI) · Graph Theory and Algorithms