Clarifying Before Reasoning: A Coq Prover with Structural Context

Yanzhen Lu; Hanbin Yang; Xiaodie Wang; Ge Zhang; Biao Li; Chenxu Fu; Chao Li; Yang Yuan; Andrew Chi-Chih Yao

arXiv:2507.02541·cs.AI·July 4, 2025

Clarifying Before Reasoning: A Coq Prover with Structural Context

Yanzhen Lu, Hanbin Yang, Xiaodie Wang, Ge Zhang, Biao Li, Chenxu Fu, Chao Li, Yang Yuan, Andrew Chi-Chih Yao

PDF

3 Reviews

TL;DR

This paper demonstrates that adding structured semantic context to theorem proving tasks in Coq significantly improves the reasoning success of large language models, highlighting the importance of task clarity.

Contribution

Introducing a concept-level metric and structured semantic context to enhance LLM reasoning in Coq theorem proving, achieving substantial performance improvements.

Findings

01

44.5% to 82.3% clarity score increase

02

Proof success rate improved from 21.8% to 45.8%

03

Structured data fine-tuning yields 48.6% success rate

Abstract

In this work, we investigate whether improving task clarity can enhance reasoning ability of large language models, focusing on theorem proving in Coq. We introduce a concept-level metric to evaluate task clarity and show that adding structured semantic context to the standard input used by modern LLMs, leads to a 1.85 $\times$ improvement in clarity score (44.5\%~ $\to$ ~82.3\%). Using the general-purpose model \texttt{DeepSeek-V3}, our approach leads to a 2.1 $\times$ improvement in proof success (21.8\%~ $\to$ ~45.8\%) and outperforms the previous state-of-the-art \texttt{Graph2Tac} (33.2\%). We evaluate this on 1,386 theorems randomly sampled from 15 standard Coq packages, following the same evaluation protocol as \texttt{Graph2Tac}. Furthermore, fine-tuning smaller models on our structured data can achieve even higher performance (48.6\%). Our method uses selective…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 5

Strengths

This was a refreshing paper to read, with genuinely novel approaches, and well thought out experiments. The core idea of separating and measuring "task clarity" as a distinct bottleneck from "reasoning" is a good contribution. The data processing pipeline, involving modification of the Coq compiler to extract internal type-theoretic information, is a non-trivial and valuable piece of engineering. The fine-tuning results (Tab 5) are highly valuable, showing a 32B parameter model achieving 48.6% s

Weaknesses

The following points are constructive feedback; I will increase my rating if the key experiments are conducted during the rebuttal. 1. The SOTA claim (45.8% vs 33.2%) rests on a comparison against one baseline (Graph2Tac). Would you be able to show via a pilot evaluation in LEAN, a direct comparison or against other provers mentioned in the related work, such as DeepSeek-Prover, Kimina-Prover, and Llemma. 2. The evaluation is on a 10% random sample of a benchmark (is it 1,300 or 1,386 theorems)

Reviewer 02Rating 4Confidence 2

Strengths

1. The central idea—enhancing reasoning by improving task clarity—is an appealing direction beyond model scaling and reinforcement learning. 2. The proposed Coq compiler interception pipeline that captures internal representations is a technically valuable contribution, potentially useful for future formal reasoning datasets. 3. This paper introduces a new metric to quantify how well a model “understands” a task.

Weaknesses

1. The “Clarity Score” uses the same model (DeepSeek-V3) as both generator and evaluator, introducing self-evaluation bias. No external or human validation is performed, so the metric may not measure conceptual understanding. 2. The comparison with Graph2Tac is not entirely fair, as Graph2Tac is a much smaller GNN-based model while this paper employs a large-scale LLM (DeepSeek-V3). Although the authors claim it is a general-purpose model not specifically trained on mathematical data, its scale

Reviewer 03Rating 4Confidence 3

Strengths

- The contribution of introducing conceptual clarity as a metric is interesting. - Empirical results are consistent. Authors show improved success rate on tasks over the baseline methods, as well as a light-weighted fine-tuned model achieving the highest success rate (matching the result on the non-fine-tuned general-purpose model).

Weaknesses

- The paper's methodology is difficult to follow. While the introduction emphasizes that one of the key contributions is enhancing task understanding in general-purpose models through enriched task descriptions, the corresponding methodological section (Section 5) lacks clarity. For instance, the presentation of the “structured semantic context” pipeline is difficult to follow, with important details scattered or deferred to the Appendix. A substantial reorganization and clearer exposition would

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.