AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance

Lixuan He; Jie Feng; Yong Li

arXiv:2508.06944·cs.LG·October 13, 2025

AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance

Lixuan He, Jie Feng, Yong Li

PDF

Open Access

TL;DR

AMFT introduces a meta-learning approach to dynamically balance imitation and exploration in LLM fine-tuning, leading to improved reasoning performance and better generalization across diverse benchmarks.

Contribution

This paper presents AMFT, a novel single-stage meta-learning algorithm that optimally balances SFT and RL rewards for LLM alignment, addressing limitations of prior two-stage methods.

Findings

01

Achieves new state-of-the-art results on reasoning benchmarks.

02

Demonstrates superior out-of-distribution generalization.

03

Shows stability and efficiency through meta-gradient control.

Abstract

Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), a process fraught with catastrophic forgetting and suboptimal trade-offs between imitation and exploration. Recent single-stage methods attempt to unify SFT and RL using heuristics, but lack a principled mechanism for dynamically balancing the two paradigms. In this paper, we reframe this challenge through the theoretical lens of \textbf{implicit rewards}, viewing SFT and RL not as distinct methods but as complementary reward signals. We introduce \textbf{Adaptive Meta Fine-Tuning (AMFT)}, a novel single-stage algorithm that learns the optimal balance between SFT's implicit, path-level reward and RL's explicit, outcome-based reward. The core of AMFT is a \textbf{meta-gradient adaptive weight controller} that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling