DAG-Math: Graph-of-Thought Guided Mathematical Reasoning in LLMs
Yuanhe Zhang, Ilja Kuzborskij, Jason D. Lee, Chenlei Leng, Fanghui Liu

TL;DR
This paper introduces DAG-MATH, a graph-based framework for evaluating and guiding mathematical reasoning in LLMs through rule-based trajectories, revealing reasoning fidelity gaps beyond traditional accuracy metrics.
Contribution
It proposes modeling Chain-of-Thought as a DAG-based stochastic process, introduces the logical closeness metric, and creates a benchmark for assessing reasoning fidelity in LLMs.
Findings
Significant differences in reasoning fidelity among LLMs with similar accuracy.
Traditional metrics may overlook rule-consistent reasoning quality.
The DAG-MATH framework balances free-form reasoning and formal proof evaluation.
Abstract
Large Language Models (LLMs) demonstrate strong performance on mathematical problems when prompted with Chain-of-Thought (CoT), yet it remains unclear whether this success stems from search, rote procedures, or rule-consistent reasoning. To address this, we propose modeling CoT as a certain rule-based stochastic process over directed acyclic graphs (DAGs), where nodes represent intermediate derivation states and edges encode rule applications. Within this framework, we introduce \textbf{logical closeness}, a metric that quantifies how well a model's CoT trajectory (i.e., the LLM's final output) adheres to the DAG structure, providing evaluation beyond classical PASS@ metrics. Building on this, we introduce the \emph{DAG-MATH} CoT format and construct a benchmark that guides LLMs to generate CoT trajectories in this format, thereby enabling the evaluation of their reasoning ability…
Peer Reviews
Decision·ICLR 2026 Poster
1. The underlying idea of treating CoT as a DAG traversal is fundamentally sound and offers a pathway for structured reasoning analysis beyond token-level checks. This is the paper's primary and most important strength. 2. The authors have created impressive few-shot prompts to enforce their complex output format, which is a valuable demonstration of structured generation control in LLMs. The visual examples of the DAGs are convincing. 3. The metric correctly isolates failure modes like speculat
1. The PRR/AUC metric confuses adherence to the authors' custom template with true logical reasoning ability. The paper must provide evidence that this metric holds up when applied to non-formatted, naturally generated CoT. 2. A critical omission is the lack of comparison with or contextualization against MCTS or similar graph-based search methods. If the goal is to improve reasoning, how does the DAG-MATH diagnosis inform or relate to these established LLM search strategies? 3. The use of LLMs
The paper tackles really important problem about understanding whether LLMs achieve correct answers through systematic search or through genuine logical reasoning, which is fundamental question for the field. The DAG-based formalization is quite novel approach that sits nicely between completely free-form CoT and very formal systems like LEAN verification, making it more practical to use. The logical closeness metric gives us insights that go beyond simple PASS@k metrics that everyone uses. The
There is concerning circularity in how the benchmark was constructed - using GPT-4 and Qwen to create the "gold standard" DAGs means the benchmark is essentially created by same type of models that are being evaluated, which introduces obvious biases. The theoretical justification feels not enough developed. Why should we believe this specific DAG formalization captures what "true" reasoning means? The stochastic process described in Equation 1 seems somewhat arbitrary choice without proper just
- The paper is clearly written and well-organized. - The idea of representing CoT reasoning with DAG-MATH is interesting and novel. The proposed metrics are also new and conceptually sound. - The empirical results are informative, showing how graph structures reflect problem difficulty and reasoning quality.
- Enforcing the DAG-MATH format may degrade the natural reasoning flexibility of LLMs. It would help to include an analysis or ablation comparing performance with and without this formatting constraint. Furthermore, if the few-shot examples are drawn from a specific model family, models of the same family might have an advantage because their reasoning patterns are similar. - The analysis is primarily quantitative. Some qualitative examples or case studies of the generated DAG-MATH graphs, espec
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Topic Modeling · Mathematics, Computing, and Information Processing
