Bridging Formal Language with Chain-of-Thought Reasoning to Geometry Problem Solving

Tianyun Yang; Yunwen Li; Ziniu Li; Zhihang Lin; Ruoyu Sun; Tian Ding

arXiv:2508.09099·cs.LG·August 13, 2025

Bridging Formal Language with Chain-of-Thought Reasoning to Geometry Problem Solving

Tianyun Yang, Yunwen Li, Ziniu Li, Zhihang Lin, Ruoyu Sun, Tian Ding

PDF

Open Access 4 Reviews

TL;DR

This paper introduces GF-Reasoner, a model that combines Chain-of-Thought reasoning with formal language to improve geometry problem solving, achieving significant accuracy gains and more interpretable reasoning traces.

Contribution

It presents a novel hybrid reasoning approach integrating natural language and formal code, trained with synthetic data and reinforcement learning, enhancing geometry problem-solving performance.

Findings

01

Achieves up to 15% accuracy improvement on GPS benchmarks.

02

Produces shorter, clearer reasoning traces with high-order geometric knowledge.

03

Outperforms larger models and previous methods in geometry problem solving.

Abstract

Large vision language models exhibit notable limitations on Geometry Problem Solving (GPS) because of their unreliable diagram interpretation and pure natural-language reasoning. A recent line of work mitigates this by using symbolic solvers: the model directly generates a formal program that a geometry solver can execute. However, this direct program generation lacks intermediate reasoning, making the decision process opaque and prone to errors. In this work, we explore a new approach that integrates Chain-of-Thought (CoT) with formal language. The model interleaves natural language reasoning with incremental emission of solver-executable code, producing a hybrid reasoning trace in which critical derivations are expressed in formal language. To teach this behavior at scale, we combine (1) supervised fine-tuning on an 11K newly developed synthetic dataset with interleaved natural…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 6Confidence 3

Strengths

1. Text, figures, tables are all good presented. 2. Experiments with various types of baselines and datasets well justify the performance of the proposed framework compared with existing methods.

Weaknesses

1. Appendix B Related Work is not cited in the main text, which is an important part to understand the contribution. 2. The so-called "interleaved formal-natural language CoT" is suspicious. Figure 1 clearly shows that the interleaved CoT involves interleaved natural-language reasoning statement and formal reasoning statement. However, in Appendix D, the so-called interleave CoT is more like a three-stage reasoning process, i.e., step-by-step natural-language reasoning, generating formal stateme

Reviewer 02Rating 4Confidence 2

Strengths

- **Clear motivation:** The paper addresses geometry problem solving, which is difficult for the current LVLMS. - **Rigorous ablations:** The study isolates the effects of the added elements such as the interleaved CoT, synthesis strategies, and RL. This helps the readers understand the contribution of each elements to the performance gain.

Weaknesses

- **Questionable necessity of the proposed approach** - The paper does not demonstrate that the hybrid approach is fundamentally better than just improving pure natural language reasoning (SFT + RL only on natural language reasoning). - To add, Interleaving CoT is not a novel idea. This is similar to how the ReAct framework works. - **The performance comparisons are not particularly fair** - The proposed model has access to formal language specs, while the other baseline models do no

Reviewer 03Rating 4Confidence 3

Strengths

1. This is a nice and well-written paper that tackles an interesting problem that LVLMs face. 2. The hybrid reasoning approach presented in the paper is novel where it combines interpretability and precision of the two core components, i.e., natural language CoT and formal reasoning. 3. Training with SFT and RL shows robust framework and performance boost. Their 7B model outperforms the peers as well as larger model of 72B. 4. The paper contributed to the field by curating a synthetic dataset

Weaknesses

1. One major concern that I have is that there are only 34 operators which means if a problem needs a geometry theorem, it is not included in the set. 2. Dataset was essentially distilled from large and strong teacher models, hence model performance is heavily dependent on these models. 3. RM is quite simple (binary) and the model doesn't take partial credit for correct steps rather than final answer.

Reviewer 04Rating 4Confidence 4

Strengths

1. The method achieves excellent performance and efficiency. The GF-Reasoner (7B) significantly outperforms the 10x larger Qwen2.5-VL-72B by +15.2% (PGPS9K) and +4.8% (UniGEO). This hybrid reasoning approach also reduces the "computation error" rate to nearly zero (0.3%) and substantially decreases the "reasoning error" rate from 23.0% to 14.3%. 2. The paper validates an effective "SFT + solver-in-the-loop RL" framework, with the RL stage alone improving Pass@1 accuracy by +26.6%. The detailed a

Weaknesses

1. The core method (CoT + external solver) is a well-established paradigm (e.g., AlphaGeometry, Toolformer, PoT, Logic-RL, ReTool, ToRL, Search-O1). The reliance on an existing solver (PGPSNet's 34 operations) shifts the contribution to data synthesis and an SFT+RL pipeline, rather than methodological innovation. 2. The framework is highly specialized, confined to a closed set of 34 predefined solver operations. This limits scalability (e.g., cannot "add auxiliary lines" per Section F) and lacks

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization · Multimodal Machine Learning Applications · AI-based Problem Solving and Planning