Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
Han Zheng, Yining Ma, Karthick Gunasekaran, Bharathan Balaji, Zheng Du, Shiv Vitaladevuni, Cathy Wu

TL;DR
This paper presents METIS, a novel internalized curriculum judgment framework for LLM reinforcement fine-tuning that improves efficiency and performance by dynamically guiding training based on self-assessed informativeness.
Contribution
METIS internalizes curriculum judgment as a native capability, using reward variance to guide training without external heuristics or auxiliary models.
Findings
METIS achieves up to 67% faster convergence.
It outperforms existing methods across diverse benchmarks.
It effectively internalizes curriculum judgment for improved LLM training.
Abstract
In LLM Reinforcement Fine-Tuning (RFT), curriculum learning drives both efficiency and performance. Yet, current methods externalize curriculum judgment via handcrafted heuristics or auxiliary models, risking misalignment with the policy's training dynamics. In this paper, we introduce METIS (METacognitive Internalized Self-judgment), a novel framework that internalizes curriculum judgment as a native capability. Leveraging a critical observation that within-prompt reward variance effectively gauges prompt informativeness, METIS predicts this metric based on recent training outcomes as lightweight in-context learning examples. This intrinsic self-judgment then dynamically dictates the training allocation. Moreover, METIS closes the loop between judgment and optimization by jointly optimizing the standard RFT rewards and a self-judgment reward. This allows the policy to learn what to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
