Lean-ing on Quality: How High-Quality Data Beats Diverse Multilingual   Data in AutoFormalization

Willy Chan; Michael Souliman; Jakob Nordhagen; Brando Miranda; Elyas; Obbad; Kai Fronsdal Sanmi Koyejo

arXiv:2502.15795·cs.AI·February 25, 2025

Lean-ing on Quality: How High-Quality Data Beats Diverse Multilingual Data in AutoFormalization

Willy Chan, Michael Souliman, Jakob Nordhagen, Brando Miranda, Elyas, Obbad, Kai Fronsdal Sanmi Koyejo

PDF

Open Access

TL;DR

This paper demonstrates that high-quality, carefully generated synthetic data via back-translation and prompt engineering significantly improves autoformalization performance of language models, outperforming large multilingual datasets with fewer resources.

Contribution

Introduces novel back-translation and prompt strategies that prioritize data quality over quantity, enhancing autoformalization with minimal data and resources.

Findings

01

Quality-focused synthetic data improves model performance.

02

Our methods outperform fine-tuning with extensive multilingual datasets.

03

Significant reduction in resource requirements for formalizing proofs.

Abstract

Autoformalization, the process of transforming informal mathematical language into formal specifications and proofs remains a difficult task for state-of-the-art (large) language models. Existing works point to competing explanations for the performance gap. To this end, we introduce a novel methodology that leverages back-translation with hand-curated prompts to enhance the mathematical capabilities of language models, particularly addressing the challenge posed by the scarcity of labeled data. Specifically, we evaluate three primary variations of this strategy: (1) on-the-fly (online) backtranslation, (2) distilled (offline) backtranslation with few-shot amplification, and (3) line-by-line proof analysis integrated with proof state information. Each variant is designed to optimize data quality over quantity, focusing on the high fidelity of generated proofs rather than sheer data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Data Mining Algorithms and Applications