MIND: Math Informed syNthetic Dialogues for Pretraining LLMs

Syeda Nahida Akter; Shrimai Prabhumoye; John Kamalu; Sanjeev Satheesh,; Eric Nyberg; Mostofa Patwary; Mohammad Shoeybi; Bryan Catanzaro

arXiv:2410.12881·cs.AI·April 28, 2025

MIND: Math Informed syNthetic Dialogues for Pretraining LLMs

Syeda Nahida Akter, Shrimai Prabhumoye, John Kamalu, Sanjeev Satheesh,, Eric Nyberg, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

PDF

Open Access 2 Datasets 3 Reviews

TL;DR

This paper introduces MIND, a novel method for generating math-informed synthetic dialogues to enhance the mathematical reasoning capabilities of large language models, leading to significant performance improvements across multiple benchmarks.

Contribution

The paper presents a large-scale, diverse synthetic dialogue generation approach that incorporates knowledge gaps and optimized data integration for improved LLM mathematical reasoning.

Findings

01

Pretraining on MIND-OWM significantly improves mathematical reasoning scores.

02

Incorporating knowledge gaps enhances the quality of synthetic math data.

03

Restructuring raw data during pretraining maximizes reasoning gains.

Abstract

The utility of synthetic data to enhance pretraining data quality and hence to improve downstream task accuracy has been widely explored in recent large language models (LLMs). Yet, these approaches fall inadequate in complex, multi-hop and mathematical reasoning tasks as the synthetic data typically fails to add complementary knowledge to the existing raw corpus. In this work, we propose a novel large-scale and diverse Math Informed syNthetic Dialogue (MIND) generation method that improves the mathematical reasoning ability of LLMs. Specifically, using MIND, we generate synthetic conversations based on OpenWebMath (OWM), resulting in a new math corpus, MIND-OWM. Our experiments with different conversational settings reveal that incorporating knowledge gaps between dialog participants is essential for generating high-quality math data. We further identify an effective way to format and…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. This paper articulates the contributions, methodologies, and results in a clear way. It is pleasant to read. 2. Despite simple and straightforward, strong experimental results showcase the effectiveness of the proposed MIND framework and the data synthesized by it. The results are evaluated on three different math corpus and they all demonstrate substantial improvements by training on the generated data.

Weaknesses

One thing I would like to point out is there is a very relevant work to discuss and compare to. Dialog Inpainting: Turning Documents into Dialogs (https://proceedings.mlr.press/v162/dai22a.html) This paper also proposes ways to synthesize conversations from knowledge sources. Although it is not specifically for math, the authors should at least discuss the differences in methods in rebuttals.

Reviewer 02Rating 5Confidence 4

Strengths

Originality: The paper presents an approach (MIND) to generate high-quality synthetic math dialogues that improve math reasoning in LMs. Compared to prior work on synthetic pretraining data that mostly rephrases raw text, MIND adds semantic variations and step-by-step reasoning that are crucial for complex math problem solving. Quality: The paper is thorough in evaluating MIND across multiple dimensions - testing different conversational styles, scaling behavior, and applicability to varying se

Weaknesses

- Since the method uses the LLAMA3-70B-INSTRUCT model to generate conversations, it is unclear whether the improvements in downstream reasoning tasks come from the quality of the generated dialogues or are simply a result of model distillation from the powerful LLAMA3-70B model. The authors should investigate this and isolate the impact of the MIND-generated dialogues from the influence of the underlying LLM. - The experiments are based on a single in-house pretrained model checkpoint. It is po

Reviewer 03Rating 5Confidence 4

Strengths

Experiments on two pretraining corpora demonstrate the helpfulness of the proposed method on reasoning tasks. This method can also be used to clean pre-training corpora.

Weaknesses

I have several concerns/questions regarding the motivation/details. - The quality of the synthetic math data: The authors introduced that heuristics are applied to filter the generated conversations and provided the similarity between raw text and the synthetic dialogues. I'm wondering about the length difference between the original text and the corresponding conversation (on average), and it is unclear the degree of hallucination issues in the generated text. - The impact of knowledge disti

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Mathematics, Computing, and Information Processing · Semantic Web and Ontologies