Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning

Haiteng Zhao; Junhao Shen; Yiming Zhang; Songyang Gao; Kuikun Liu; Tianyou Ma; Fan Zheng; Dahua Lin; Wenwei Zhang; Kai Chen

arXiv:2512.10534·cs.AI·March 6, 2026

Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning

Haiteng Zhao, Junhao Shen, Yiming Zhang, Songyang Gao, Kuikun Liu, Tianyou Ma, Fan Zheng, Dahua Lin, Wenwei Zhang, Kai Chen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces InternGeometry, a large language model agent that uses iterative reasoning and complexity-boosted reinforcement learning to solve high-level geometry problems, surpassing expert models with minimal data.

Contribution

The work presents InternGeometry, a novel LLM agent that overcomes heuristic limitations in geometry problem solving through iterative proposals, symbolic verification, and complexity-based training.

Findings

01

Solves 44 of 50 IMO geometry problems, exceeding medalist scores.

02

Uses only 13K training examples, far less than previous models.

03

Proposes novel auxiliary constructions not seen in human solutions.

Abstract

Large language model (LLM) agents exhibit strong mathematical problem-solving abilities and can even solve International Mathematical Olympiad (IMO) level problems with the assistance of formal proof systems. However, due to weak heuristics for auxiliary constructions, AI for geometry problem solving remains dominated by expert models such as AlphaGeometry 2, which rely heavily on large-scale data synthesis and search for both training and evaluation. In this work, we make the first attempt to build a medalist-level LLM agent for geometry and present InternGeometry. InternGeometry overcomes the heuristic limitations in geometry by iteratively proposing propositions and auxiliary constructions, verifying them with a symbolic engine, and reflecting on the engine's feedback to guide subsequent proposals. A dynamic memory mechanism enables InternGeometry to conduct more than two hundred…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 2

Strengths

1. Propose the first LLM agent for IMO-level geometry proving, avoiding the use of specialist models. 2. Propose a dynamic memory mechanism and rejection sampling strategy, enabling up to 200-step interactive reasoning and guiding diverse explorations in interactions. 3. Solid experiments well justify that InternGeometry outperforms current SOTA models, with exceptional data efficiency. Comprehensive ablation studies validate the necessity of key components like long-range interactions, CBRL, an

Weaknesses

1. The title is not clear. Since the manuscript focuses on developing plane geometry prover, the title should contain such information. 2. Equation 5 extends beyond the page margin. 3. Experiments on more datasets (e.g., JGEX-AG-231 proposed in AlphaGeometry) and other LLMs (other than InternThinker) can further demonstrate the generalization ability of InternGeometry. 4. Considering that the interactive reasoning requires many steps, analyzing and comparing the computational resources of Intern

Reviewer 02Rating 6Confidence 3

Strengths

1. InternGeometry is the new SOTA on IMO 50. 2. The fact that the training set of geometry problems is much smaller than in previous approaches is impressive. 3. The analysis about scaling max steps vs. scaling # samples is insightful.

Weaknesses

1. The paper would benefit from more discussion regarding inference-time costs of the compared methods. For example, listing the model sizes in Table 1 would be helpful, as well as explaining the different search parameters in AlphaGeometry2’s custom beam search. If available, information about the total # of output tokens or wallclock time etc. would also be appreciated. 2. Similarly, I think it would be nice if the paper also discussed training costs and compared them with previous methods. C

Reviewer 03Rating 8Confidence 4

Strengths

1. The introduction of a new tool call for leveraging strong planning and reasoning of LLMs for dynamic interaction with the formal deductive engine is novel and an exciting idea. The paper did a good job explaining this tool call and motivating it, which makes their high-level approach clear and sound. 2. As mentioned in the summary, much trial and error is typically expected for figuring out the helpful auxiliary constructions for solving the given problem, and this fact would require the pro

Weaknesses

1. While the RL reward and RL loss are clearly defined, the handling and explanation of the curriculum algorithm lacks clarity, which makes it hard to evaluate the soundness of the proposed curriculum approach. The paper attempts to touch on the theory behind the curriculum algorithm on the surface, and both Theorem 1 and 2 statements are hard to follow and vague, and could be better explained. More importantly, the paper does not explain the CBRL algorithm with sufficient detail and particularl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Polynomial and algebraic computation · Mathematics Education and Teaching Techniques