Characterizing Paraphrase-Induced Failures in Lean 4 Autoformalization
William Feng, Ethan Lou, Aryan Sharma

TL;DR
This paper investigates how paraphrasing math problem statements affects the correctness of autoformalization in Lean 4, revealing that current models often fail at code generation and are sensitive to input variations.
Contribution
It provides a detailed analysis of paraphrase-induced failures in Lean 4 autoformalization, introducing a taxonomy of failure modes and highlighting the need for targeted training interventions.
Findings
Paraphrase sensitivity mainly occurs at the code-generation layer.
Failures are categorized by dataset and model type.
State-of-the-art models still struggle with generating valid Lean code.
Abstract
Lean 4 autoformalization has become increasingly popular in recent years, with frontier language models and open-weight autoformalizers now producing valid formalizations of mathematical theorems. However, these evaluations often rely on single canonical phrasings of theorems and rarely probe whether outputs are robust to natural variation in inputs, while prior work has shown that semantically equivalent paraphrases often induce divergent formal outputs. We study the structure of these divergences in Lean 4 by applying deterministic paraphrase rules to datasets of undergraduate and Olympiad-level math problems. Across four frontier models and three open-weight autoformalizers, we find that paraphrase sensitivity is dominated by failures at the code-generation layer, and that these failures are typed differently by dataset. Furthermore, these patterns generalize to open-weight models,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
