ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization
Guoxin Chen, Jing Wu, Xinjie Chen, Wayne Xin Zhao, Ruihua Song, Chengxi Li, Kai Fan, Dayiheng Liu, Minpeng Liao

TL;DR
ReForm introduces a reflective autoformalization approach with iterative self-correction and semantic validation, significantly improving the accuracy of translating natural language mathematics into formal statements.
Contribution
The paper presents ReForm, a novel reflective autoformalization method with Prospective Bounded Sequence Optimization for better semantic fidelity.
Findings
ReForm outperforms baselines by 22.6 percentage points on four benchmarks.
Introduces ConsistencyCheck, a benchmark with 859 expert-annotated items.
Human experts still make semantic errors in up to 38.5% of cases.
Abstract
Autoformalization, which translates natural language mathematics into machine-verifiable formal statements, is critical for using formal mathematical reasoning to solve math problems stated in natural language. While Large Language Models can generate syntactically correct formal statements, they often fail to preserve the original problem's semantic intent. This limitation arises from the LLM approaches' treating autoformalization as a simplistic translation task which lacks mechanisms for self-reflection and iterative refinement that human experts naturally employ. To address these issues, we propose ReForm, a Reflective Autoformalization method that tightly integrates semantic consistency evaluation into the autoformalization process. This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors through progressive…
Peer Reviews
Decision·ICLR 2026 Poster
1. **Novel Reflective Paradigm:** The core innovation is shifting autoformalization from a one-pass translation task to an iterative, self-correcting process. By mimicking the human expert's cycle of generation, validation, and refinement, ReForm directly addresses the critical challenge of semantic fidelity, moving beyond mere syntactic correctness. 2. **Effective Training with PBSO:** The proposed Prospective Bounded Sequence Optimization (PBSO) algorithm is a clever solution to the multi-o
1. **Heacy Dependence on LLM-based Evaluation**: The entire training and evaluation framework relies heavily on LLM judges (like Qwen3-235B and CriticLean-14B) to assess semantic consistency. While the authors rigorously validate these judges with their ConsistencyCheck benchmark, they still have a non-trivial error rate (about 17%). 2. **Insufficient Ablation on Key Algorithmic Components**: While the paper demonstrates the overall effectiveness of PBSO, it lacks ablation studies on several cr
- The target problem is well-motivated and addresses a clear bottleneck in formal reasoning: the semantic fidelity in autoformalization. - The integration of reflective reasoning and reinforcement learning is novel and effective; the experiments demonstrate consistent and interpretable improvements across multiple benchmarks. - The newly established ConsistencyCheck benchmark provides a valuable resource for quantitatively assessing the reliability of LLM-based metrics and for understanding the
- The paper relies heavily on LLM-based semantic evaluation metrics, which, despite the ConsistencyCheck benchmark, may still introduce bias or circularity in measuring semantic consistency. - The computational cost and efficiency trade-offs of the reflective multi-iteration process are not fully analyzed — it remains unclear how scalable the approach is for large-scale or more complex formal systems. - Several important related works are missing; please refer to the recent surveys [1, 2] for
- The paper provides a **ConsistencyCheck benchmark** that rigorously evaluates the reliability of **LLM-based judges** and quantifies the challenges of **autoformalization**. - Introduces **iterative reflection techniques from reasoning models** into the process of **autoformalization**. - The paper is **well-written**.
- LLM is not a perfect supervisory signal, and I would like to know to what extent the capability of the judge model affects the stability of training. - How did you handle **semantic alignment** in your method — was it **integrated into the reward design**? - Please provide an **ablation study comparing process-level supervision and outcome-level supervision**.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
