MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training
Jonathan Drechsel, Anja Reusch, Steffen Herbold

TL;DR
This paper introduces MAMUT, a framework for generating diverse mathematical formulas in LaTeX to create specialized datasets that improve language models' understanding of mathematical notation.
Contribution
MAMUT is a novel framework that produces equivalent and falsified formulas, enabling the creation of diverse datasets for training models on mathematical content.
Findings
Models trained on MAMUT datasets achieve new state-of-the-art performance in mathematical retrieval.
Generated datasets contain diverse notation representations of the same mathematical concepts.
The framework effectively captures the variety in mathematical notation for improved model training.
Abstract
Mathematical formulas are a fundamental and widely used component in various scientific fields, serving as a universal language for expressing complex concepts and relationships. While state-of-the-art transformer models excel in processing and understanding natural language, they encounter challenges with mathematical notation, which involves a complex structure and diverse representations. This study focuses on the development of specialized training datasets to enhance the encoding of mathematical content. We introduce Math Mutator (MAMUT), a framework capable of generating equivalent and falsified versions of a given mathematical formula in LaTeX notation, effectively capturing the mathematical variety in notation of the same concept. Based on MAMUT, we have generated four large mathematical datasets containing diverse notation. Experiments show that models trained on these datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
