AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Generation
Xianyang Liu, Yilin Liu, Shuai Wang, Hao Cheng, Andrew Estornell, Yuzhi Zhao, Jun Shu, Jiaheng Wei

TL;DR
AgenticMath introduces an innovative multi-stage agentic approach to generate high-quality mathematical question-answer pairs, significantly improving LLM reasoning performance with less data compared to traditional large-scale datasets.
Contribution
The paper presents a novel agentic data generation method for creating high-quality math datasets, enhancing LLM reasoning with fewer samples than existing large datasets.
Findings
Fine-tuning LLMs on AgenticMath data yields superior reasoning performance.
High-quality, targeted data can outperform larger, less curated datasets.
The method reduces data requirements while maintaining or improving accuracy.
Abstract
The creation of high-quality datasets to improve Large Language Model (LLM) reasoning remains a significant challenge, as current methods often suffer from generating low-quality/incorrect answers and limited information richness from available data sources. To address this, we propose AgenticMath, a novel agentic method for generating high-quality mathematical question-answer pairs to enhance the supervised fine-tuning of LLMs. Our method operates through four stages: (1) Seed Question Filter that selects questions with high information richness, complexity, and clarity; (2) an Agentic Question Rephrase step that employs a multi-agent system to generate diverse, logically consistent paraphrases; (3) an Answer Augment step where rewrite answers using chain-of-thought reasoning to enhance numerical and logical correctness, without reliance on human-provided labels; and (4) a final…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
* The method is simple and results in very sample efficient synthetic data * The use of DS$^2$ score curation method in the context of synthetic data generation is novel and interesting * I liked the ablation studies included in Section 4.3
* Clarity of writing in Section 3.5 could be improved (specifically in Lines 288-291) * My biggest concern Is that comparisons with MetaMath, Dart-MATH etc. are not fair - because the question generation and more importantly, the solution generation models aren't the same. As evident from the results, most of the gains come from solution augmentation. MetaMath and DART-Math both use weaker teacher models to sample solutions. A fairer comparison would be using GPT-4o-mini as the generation model
- AgenticMath at 30K–60K samples consistently matches or beats baselines across Qwen2.5-3B, DeepSeekMath-7B, Mistral-7B, and Llama3-8B, trained on hundreds of thousands to millions of examples, demonstrating high efficiency. - Comprehensive analysis. The paper quantifies incremental gains from each stage: solution augmentation delivers the biggest single jump, while filtering, rephrasing, and review/revise provide additive improvements.
- The seed filter threshold score τ=3 seems not to be optimal according to Table 4. Other key hyperparameters ( review threshold τrev=4.5, max three review–revise iterations) lack ablation studies. - Both the problem proposer and evaluator are GPT-4o-mini, which may create self-judging biases. - The appendix lacks substantial details. It should include more design details (e.g., prompt tuning), ablations on threshold values, and ablations on the implementation of “long-tail” diversity selector (
1. This work proposed a clear agentic pipeline to construct high-quality reasoning corpora. 2. Fine-tuning on the AgenticMathQA dataset outperforms baselines trained on larger datasets.
1. The four-stage pipeline, which involves filtering, synthesizing, refining mathematical problems and solutions, and evaluation, appears rather conventional and straightforward, lacking clear novelty. 2. The overall quality of the dataset is determined by the scores assigned at each stage of the pipeline. However, since the score-based filtering relies entirely on human-designed priors, the process is overly heuristic and lacks robustness. 3. The framework is essentially a form of data distilla
+ The authors were able to successfully generate training data related to math tasks for the model's instruction fine-tuning process. + The model was trained on synthetic data and achieved improvements.
+ This method of data synthesis is highly similar to previous works such as JiuZhang 3.0, Dart-Math, ScaleQuest, and MAmmoTH2.0, and it does not show significant improvement in training results. The contribution of this paper is very limited. + This paper uses the concept of an Agent to package the method, but in reality, it does not utilize Agent-related features such as memory. The method has little to do with Agents. + The evaluation is unreliable. Both the in-domain and out-of-domain tasks
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Topic Modeling · Intelligent Tutoring Systems and Adaptive Learning
