AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance
Lixuan He, Jie Feng, Yong Li

TL;DR
AMFT introduces a meta-learning approach to dynamically balance imitation and exploration in LLM fine-tuning, leading to improved reasoning performance and better generalization across diverse benchmarks.
Contribution
This paper presents AMFT, a novel single-stage meta-learning algorithm that optimally balances SFT and RL rewards for LLM alignment, addressing limitations of prior two-stage methods.
Findings
Achieves new state-of-the-art results on reasoning benchmarks.
Demonstrates superior out-of-distribution generalization.
Shows stability and efficiency through meta-gradient control.
Abstract
Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), a process fraught with catastrophic forgetting and suboptimal trade-offs between imitation and exploration. Recent single-stage methods attempt to unify SFT and RL using heuristics, but lack a principled mechanism for dynamically balancing the two paradigms. In this paper, we reframe this challenge through the theoretical lens of \textbf{implicit rewards}, viewing SFT and RL not as distinct methods but as complementary reward signals. We introduce \textbf{Adaptive Meta Fine-Tuning (AMFT)}, a novel single-stage algorithm that learns the optimal balance between SFT's implicit, path-level reward and RL's explicit, outcome-based reward. The core of AMFT is a \textbf{meta-gradient adaptive weight controller} that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
