Lean-ing on Quality: How High-Quality Data Beats Diverse Multilingual Data in AutoFormalization
Willy Chan, Michael Souliman, Jakob Nordhagen, Brando Miranda, Elyas, Obbad, Kai Fronsdal Sanmi Koyejo

TL;DR
This paper demonstrates that high-quality, carefully generated synthetic data via back-translation and prompt engineering significantly improves autoformalization performance of language models, outperforming large multilingual datasets with fewer resources.
Contribution
Introduces novel back-translation and prompt strategies that prioritize data quality over quantity, enhancing autoformalization with minimal data and resources.
Findings
Quality-focused synthetic data improves model performance.
Our methods outperform fine-tuning with extensive multilingual datasets.
Significant reduction in resource requirements for formalizing proofs.
Abstract
Autoformalization, the process of transforming informal mathematical language into formal specifications and proofs remains a difficult task for state-of-the-art (large) language models. Existing works point to competing explanations for the performance gap. To this end, we introduce a novel methodology that leverages back-translation with hand-curated prompts to enhance the mathematical capabilities of language models, particularly addressing the challenge posed by the scarcity of labeled data. Specifically, we evaluate three primary variations of this strategy: (1) on-the-fly (online) backtranslation, (2) distilled (offline) backtranslation with few-shot amplification, and (3) line-by-line proof analysis integrated with proof state information. Each variant is designed to optimize data quality over quantity, focusing on the high fidelity of generated proofs rather than sheer data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Data Mining Algorithms and Applications
