Multilingual Mathematical Autoformalization
Albert Q. Jiang, Wenda Li, Mateja Jamnik

TL;DR
This paper introduces MMA, a large multilingual dataset for autoformalization, enabling language models to translate formal mathematical statements into informal language, significantly improving their performance on benchmarks.
Contribution
The creation of MMA, a novel multilingual, multi-domain dataset for autoformalization, and demonstrating its effectiveness in enhancing language model performance.
Findings
Fine-tuned models achieve 16-18% acceptable statements on benchmarks.
Base models had 0% acceptable statements without fine-tuning.
Multilingual training improves monolingual autoformalization capabilities.
Abstract
Autoformalization is the task of translating natural language materials into machine-verifiable formalisations. Progress in autoformalization research is hindered by the lack of a sizeable dataset consisting of informal-formal pairs expressing the same essence. Existing methods tend to circumvent this challenge by manually curating small corpora or using few-shot learning with large language models. But these methods suffer from data scarcity and formal language acquisition difficulty. In this work, we create , a large, flexible, multilingual, and multi-domain dataset of informal-formal pairs, by using a language model to translate in the reverse direction, that is, from formal mathematical statements into corresponding informal ones. Experiments show that language models fine-tuned on produce of statements acceptable with minimal corrections on…
Peer Reviews
Decision·Submitted to ICLR 2024
Strengths The Multilingual Mathematical Autoformalization (MMA) dataset was ingeniously synthesized using GPT-4, resulting in a substantial collection of 332,000 informal-formal pairings across multiple formal languages. With its multilingual and multidomain composition, the dataset notably exceeds the size of the largest existing datasets in the field. The research showcases improved outcomes when compared against established baselines.
Weaknesses The integrity of the MMA dataset warrants further examination, as it was entirely auto-generated via GPT-4, potentially leading to mismatched cases. The paper would benefit from an expanded discussion on the discrepancies and the inclusion of findings from human evaluation. The paper's innovative contribution seems incremental, largely relying on the automated capabilities of GPT-4 for dataset generation, which suggests a limited technical advancement.
* This work is the first effort in distilling mathematical formalization from the LLM, a contribution given the notable scarcity of parallel datasets of mathematical formalization. * Both the generated dataset and the fine-tuned model will be made publicly available for further research and development.
* The evaluation could benefit from further enhancement. Currently, a sample size of only 50 examples from the benchmark is used, which may not provide sufficiently convincing results due to potential statistical limitations. * As the author also mentioned, the synthesized datasets could be noisy. It would be beneficial to also include GPT4 in your comparative analysis for formalization quality. This would offer deeper insights into the noise levels within the generated datasets, i.e., how nois
This is an important problem, and the dataset will be helpful for many others working in this area. The dataset is the largest of its kind by far. In the introduction, the fact that the dataset covers multiple formal languages does not necessarily sound like a selling point to me, until you point out that training a model on multiple formal languages helps. I think this is a strong point that could be emphasized more (if I understand it correctly, see below under Questions).
I feel that the word "Multilingual" is confusing because it sounds like it works on multiple natural languages. However, I admit that "multi-formal-language" is awkward. The data is automatically generated by back-translation ("informalization"). This is good because informalization appears to be an easier task than formalization and because it's more important for the target (formal language) side of the data to be high quality. However, it also does mean that the source (natural language) sid
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsBalanced Selection
