On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

Wenhao Zhang; Yuexiang Xie; Yuchang Sun; Yanxi Chen; Guoyin Wang; Yaliang Li; Bolin Ding; Jingren Zhou

arXiv:2508.11408·cs.LG·March 18, 2026

On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, Jingren Zhou

PDF

3 Reviews

TL;DR

This paper introduces CHORD, a framework that dynamically balances supervised fine-tuning and reinforcement learning in large language models, improving stability and performance by harmonizing off-policy expert data with on-policy exploration.

Contribution

It presents a novel unified view of SFT and RL, incorporating dynamic weighting and dual-control mechanisms to enhance learning stability and effectiveness.

Findings

01

CHORD outperforms baseline methods in various tasks.

02

Dynamic weighting improves the balance between imitation and exploration.

03

The framework promotes stable and efficient learning.

Abstract

Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established response patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data's influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. This paper gives a clear empirical documentation of SFT instability under off-policy expert trajectories. 2. The hybrid objective (μ-weighted SFT + RL) is easy-to-implement. 3. The token-wise weighting is a simple stability heuristic; and the ablations on μ and training dynamics are decent.

Weaknesses

1. The idea of combining supervised learning and RL during fine-tuning has been explored in prior works (e.g., SRFT, SimpleMix, LUFFY). CHORD uses a similar structure by optimizing a weighted sum of SFT loss and GRPO — with the addition of a global schedule μ and a token-level weight φ(y)=p(1–p). 2. The heuristic p(1−p) is plausible but lacks theoretical backing or strong comparisons to alternative uncertainty weights (entropy/focal/margin). 3. There is a heavy reliance on DeepSeek-R1 experts;

Reviewer 02Rating 4Confidence 3

Strengths

- The paper targets an important and timely problem in large language model post-training: how to combine supervised expert data with reinforcement learning in an effective way. - The proposed framework is simple, well-motivated, and easy to implement in existing RLHF pipelines. The dual-control design (μ and ϕ) provides both stage-level and token-level balance between on- and off-policy learning. - Experiments are extensive and include ablations (fixed vs. dynamic μ, with vs. without ϕ), en

Weaknesses

The novelty of CHORD is limited. The method reweights two existing loss terms (SFT and RL) using a dynamic coefficient and a heuristic token-wise weighting. Similar annealing strategies and uncertainty-based regularization have been explored in LUFFY, SRFT, and PPO variants with KL or imitation penalties. - The token-level weighting ϕ(p)=p(1−p) is conceptually similar to entropy-based weighting and lacks theoretical justification for its specific form. - The improvement margins over baselines

Reviewer 03Rating 6Confidence 3

Strengths

- The paper provides a clear description of the two objectives used in their method called CHORD. Furthermore, the paper clearly describes the experimental setup and results. - The analysis conducted to motivate the method is clear. - Empirical results suggest that the proposed method (CHORD-\phi) improves over several reasonable baselines on Math and Tool-use cases. The ablations included in the main paper suggest that the transition from offline imitation to online-RL learning is effective as

Weaknesses

- The paper proposes one simple way to combine the two objectives. It's not clear why a convex combination of SFT and RL objectives is the right approach. Would it be possible to have generic weights for SFT and RL and let the model and data decide their optimal values? - The objective for Chord-\phi uses a weight that looks like the variance of a Bernoulli random variable. Just like above, is this the optimal value for this weight? Are there any insights on what might happen if the base model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.