MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task
Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Xin Xu, Mengdi Zhang, Jian Shao, Yueting Zhuang

TL;DR
MathFimer introduces a fill-in-the-middle inspired framework for expanding intermediate reasoning steps in mathematical solutions, significantly improving large language models' reasoning performance without high computational costs.
Contribution
The paper presents MathFimer, a novel method for step expansion in mathematical reasoning that enhances model training data and performance without external models or heavy computation.
Findings
Models trained on MathFimer-expanded data outperform original data counterparts.
MathFimer improves reasoning accuracy on benchmarks like GSM8K and MATH.
The approach is scalable and does not require external models.
Abstract
Mathematical reasoning represents a critical frontier in advancing large language models (LLMs). While step-by-step approaches have emerged as the dominant paradigm for mathematical problem-solving in LLMs, the quality of reasoning steps in training data fundamentally constrains the performance of the models. Recent studies have demonstrated that more detailed intermediate steps can enhance model performance, yet existing methods for step expansion either require more powerful external models or incur substantial computational costs. In this paper, we introduce MathFimer, a novel framework for mathematical reasoning step expansion inspired by the ''Fill-in-the-middle'' task from code reasoning. By decomposing solution chains into prefix-suffix pairs and training models to reconstruct missing intermediate steps, we develop a specialized model, MathFimer-7B, on our carefully curated…
Peer Reviews
Decision·ICLR 2026 Poster
1. The use of FIM for step expansion in mathematical reasoning is innovative and offers a scalable alternative to expensive methods like MCTS or distillation from larger models. 2. Extensive experiments across multiple datasets and model sizes show consistent improvements (e.g., +7.43% on GSM8K, +8.86% on MATH), demonstrating the method’s effectiveness. 3. The approach works well even with smaller models (e.g., 1.5B), and supports iterative expansion, making it computationally efficient and wide
1. The method is only evaluated on mathematical reasoning. Its applicability to other domains (e.g., logic, code, commonsense reasoning) remains unclear. 2. The expanded steps are generated by a model and not verified for correctness or logical consistency, which may introduce errors, especially after multiple iterations. 3. The effectiveness of MathFimer heavily relies on the quality of the initial CoT data. Poor-quality base solutions could limit or mislead the expansion process.
- The paper introduces a creative extension of the FIM objective, that was previously used in code completion, to mathematical reasoning. This adaptation is conceptually elegant and non-trivial because reasoning chains differ structurally from code. - The idea of training a model for FIM - MathFimer - for inserting plausible intermediate steps into existing verified solutions is also elegant and computationally efficient. - The experiments are extensive, covering multiple datasets (GSM8K, MATH,
- While the paper demonstrates strong results in mathematical reasoning, the scope is narrowly confined to math. That FIM-based reasoning expansion could be a general mechanism for improving structured reasoning is not empirically validated beyond this domain. - In table 1, there is drop in performance with MathFimer in some cases - this is not discussed. - The evaluation relies heavily on LLM-as-a-judge for correctness and PRM (process reward models) for reasoning quality. While these are reaso
* The use of the Fill-in-the-Middle paradigm for explicitly expanding reasoning steps is a creative and new application. * The experiments show consistent and significant performance improvements across various base models and diverse benchmarks.
* The expansion process is contingent on the correctness and structure of the initial CoT steps. If the original CoT contains logical errors or fundamentally flawed reasoning, the FIM process might only elaborate on the mistake, potentially creating overly confident but incorrect training data. * While detailed steps are generally beneficial, the FIM approach risks generating reasoning chains that are unnecessarily verbose or contain redundant intermediate steps, which could increase inference
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducational Games and Gamification · Intelligent Tutoring Systems and Adaptive Learning
