QUATRO: Query-Adaptive Trust Region Policy Optimization for LLM Fine-tuning
Doyeon Lee, Eunyi Lyou, Hyunsoo Cho, Sookyung Kim, Joonseok Lee, Jaemoo Choi

TL;DR
QUATRO introduces a principled, trust-region-based approach for LLM fine-tuning that improves stability and control over policy updates, outperforming heuristic methods in diverse reasoning tasks.
Contribution
It proposes a novel optimization method that directly enforces trust-region constraints, enhancing stability and interpretability in RL-based LLM fine-tuning.
Findings
Stable training under high policy staleness
Maintains controlled entropy during training
Outperforms heuristic trust-region methods on reasoning benchmarks
Abstract
GRPO-style reinforcement learning (RL)-based LLM fine-tuning algorithms have recently gained popularity. Relying on heuristic trust-region approximations, however, they can lead to brittle optimization behavior, as global importance-ratio clipping and group-wise normalization fail to regulate samples whose importance ratios fall outside the clipping range. We propose Query-Adaptive Trust-Region policy Optimization (QUATRO), which directly enforces trust-region constraints through a principled optimization. This yields a clear and interpretable objective that enables explicit control over policy updates and stable, entropy-controlled optimization, with a stabilizer terms arising intrinsically from the exact trust-region formulation. Empirically verified on diverse mathematical reasoning benchmarks, QUATRO shows stable training under increased policy staleness and aggressive learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks
