Trust-Region Adaptive Policy Optimization

Mingyu Su; Jian Guan; Yuxian Gu; Minlie Huang; Hongning Wang

arXiv:2512.17636·cs.LG·December 22, 2025

Trust-Region Adaptive Policy Optimization

Mingyu Su, Jian Guan, Yuxian Gu, Minlie Huang, Hongning Wang

PDF

Open Access 3 Reviews

TL;DR

TRAPO is a hybrid training framework for large language models that combines supervised fine-tuning and reinforcement learning within each training instance, improving reasoning abilities and surpassing existing methods.

Contribution

The paper introduces TRAPO, a novel hybrid training method that interleaves SFT and RL, along with Trust-Region SFT and adaptive prefix selection, to enhance reasoning in LLMs.

Findings

01

TRAPO outperforms standard SFT, RL, and combined pipelines on reasoning benchmarks.

02

Trust-Region SFT stabilizes training and promotes mode-seeking updates.

03

Adaptive prefix selection improves guidance efficiency.

Abstract

Post-training methods, especially Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), play an important role in improving large language models' (LLMs) complex reasoning abilities. However, the dominant two-stage pipeline (SFT then RL) suffers from a key inconsistency: SFT enforces rigid imitation that suppresses exploration and induces forgetting, limiting RL's potential for improvements. We address this inefficiency with TRAPO (\textbf{T}rust-\textbf{R}egion \textbf{A}daptive \textbf{P}olicy \textbf{O}ptimization), a hybrid framework that interleaves SFT and RL within each training instance by optimizing SFT loss on expert prefixes and RL loss on the model's own completions, unifying external supervision and self-exploration. To stabilize training, we introduce Trust-Region SFT (TrSFT), which minimizes forward KL divergence inside a trust region but attenuates optimization…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

Motivation is concrete. The paper clearly diagnoses why forward-KL SFT can hurt exploration when combined online with RL (distribution blending, degenerate rollouts) and proposes a minimal, targeted fix (clipping the per-token weight with alpha). Simple and practical. TrSFT is a one-line modification of the SFT loss; micro-group sampling is a lightweight, per-prompt procedure (ratios 0, 0.2, 0.5, 1.0; thresholds −1, 0.5, 0.7, 0.9; group sizes {4, 2, 1, 1}). Consistent improvements. On Qwen2.5-

Weaknesses

Alpha inconsistency; please clarify. The main setup and alpha sweep indicate alpha = 0.1 works best. In Appendix C.2, however, TRAPO is described “except for setting the trust-region parameter alpha to 1,” which would nullify clipping and contradict the rest of the paper. Please reconcile and state the alpha used for every table/figure. Theory–practice gap and Proposition 1 details. The toy GMM example and the TrSFT optimum (Proposition 1) are helpful intuitively, but the proof sketch appeals t

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper is well-organized and easy to follow. The method is clearly defined, and the figures are clear and helpful. 2. The ablation results support the design. Micro-group sampling alone improves over GRPO, while replacing TrSFT with standard SFT causes a clear drop. This suggests the trust region brings stability, and the adaptive prefixes provide useful guidance.

Weaknesses

1. This paper only validates TRAPO’s performance on established benchmarks, including 5 mathematical reasoning benchmarks and 2 general-domain reasoning benchmarks, while excluding newer, harder datasets like AIME25 and GPQA. 2. The micro-group sampling of TRAPO requires sequential processing for each prompt’s micro-groups within a mini-batch. This sequential logic for micro-groups is far less parallelizable than baselines like pure GRPO. Additionally, the training time is not provided, so the

Reviewer 03Rating 6Confidence 4

Strengths

1. The per-instance coupling, SFT on prefixes + RL on suffixes, neatly addresses the two-stage inconsistency the authors highlight. This method directly targets the lack of exploration space for SFT'ed policies, which is a concrete and important problem in RL today. The micro-group schedule is clear and practical. 2. Clipping the token-level gradient weight with a threshold α seems to be an intuitive way to prevent outsized updates on low-probability expert tokens. Relevant hyperparameter study

Weaknesses

1. One (or a type of) baseline I consider missing from the paper is existing work that directly improve sampling diversity during rollout in RL. Simple baseline might include increasing temperature, or other algorithms that encourage diverse samples. I recognize the the authors' argument that SFT may decrease sampling diversity, but there is also simple entropy control mechanisms like [1] (and possibly some others). 2. Training relies on OpenR1-Math expert trajectories (DeepSeek-R1), and the cor

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning