Synthesis by Design: Controlled Data Generation via Structural Guidance
Lei Xu, Sirui Chen, Yuxuan Huang, Chaochao Lu

TL;DR
This paper introduces a method for controlled data generation for mathematical reasoning in large language models by extracting structural information from generated solutions, resulting in higher-quality datasets and improved reasoning performance.
Contribution
We propose a novel approach that uses structural guidance from generated solutions to improve data synthesis for mathematical reasoning tasks.
Findings
Generated 39K problems with labeled intermediate steps.
Created a 6.1K-problem higher difficulty benchmark.
Fine-tuning on our dataset improves LLM reasoning performance.
Abstract
Mathematical reasoning remains challenging for LLMs due to complex logic and the need for precise computation. Existing methods enhance LLM reasoning by synthesizing datasets through problem rephrasing, but face issues with generation quality and problem complexity. To address this, we propose to extract structural information with generated problem-solving code from mathematical reasoning and guide data generation with structured solutions. Applied to MATH and GSM8K, our approach produces 39K problems with labeled intermediate steps and a 6.1K-problem benchmark of higher difficulty. Results on our benchmark show that model performance declines as reasoning length increases. Additionally, we conducted fine-tuning experiments using the proposed training data on a range of LLMs, and the results validate the effectiveness of our dataset. We hope the proposed method and dataset will…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Mathematics Education and Teaching Techniques · History and Theory of Mathematics
