Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

Jiakang Wang; Runze Liu; Fuzheng Zhang; Xiu Li; Guorui Zhou; Ling Pan

arXiv:2507.15778·cs.CL·May 18, 2026

Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

Jiakang Wang, Runze Liu, Fuzheng Zhang, Xiu Li, Guorui Zhou, Ling Pan

PDF

1 Repo 1 Models 1 Datasets 3 Reviews

TL;DR

This paper introduces Archer, a dual-token constraint framework for RLVR that enhances reasoning in large language models by differentiating token types during training.

Contribution

Archer is the first entropy-aware RLVR method that preserves sequence dependencies while applying distinct constraints to reasoning and knowledge tokens.

Findings

01

Archer outperforms strong baselines on mathematical reasoning and code generation tasks.

02

It improves pass@1 and pass@K performance across multiple model scales.

03

Differentiated token constraints lead to better reasoning and knowledge retention.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training method for improving the reasoning abilities of Large Language Models (LLMs). However, existing methods mainly apply uniform optimization constraints across all tokens, ignoring their heterogeneous roles. Prior work shows that high-entropy tokens are closely tied to reasoning, while low-entropy tokens primarily encode factual knowledge, and recent approaches attempt to exploit this distinction by isolating token updates via masking or asynchronous training. We argue that such isolation breaks the sequential dependency structure of autoregressive generation, leading to suboptimal learning. To address this, we propose \textbf{Archer}, an entropy-aware RLVR framework with \textbf{dual-token constraints} that preserves joint optimization while modulating update strength across token types. Our method…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

1. The method is simple. Step-wise entropy is easy to compute and does highlight where models tend to struggle. 2. The paper gives ablations showing that removing KL on low-entropy tokens causes collapse.

Weaknesses

1. The core assumption is that entropy reliably separates reasoning from knowledge, which is questionable. Entropy shifts with sampling, style, and prompt form, and generation length. 2. The paper uses some heuristics which needs more justification to claim effectiveness, such as the fixed empirical quantile threshold for entropy. 3. The evaluation is narrow (math, code) and does not test whether the method generalizes to less structured reasoning. I feel them not convicing since we all know t

Reviewer 02Rating 6Confidence 4

Strengths

1. This proposed method adds a clear, implementation‑friendly mechanism—token‑typed constraints from response‑level entropy—that integrates seamlessly with existing GRPO/DAPO setups. The decision to keep synchronous updates while relaxing constraints on reasoning tokens is well motivated, and the visual analysis of optimization regions and token interleaving makes the intuition concrete. 2. The paper demonstrates consistent improvements across math and code benchmarks, suggesting the approach b

Weaknesses

1. All results rely on a single 1.5B base (DeepSeek‑R1‑Distill‑Qwen‑1.5B); adding a second backbone or a larger‑scale model would strengthen generality. Further, The entropy quantile (ρ) is fixed without a sensitivity study, and the work would benefit from either a small sweep or an adaptive rule of thumb. 2. No direct head‑to‑head with token masking/asynchronous baselines. The paper critiques these strategies but does not supply a controlled replication (same data/compute) of a recent masking

Reviewer 03Rating 6Confidence 3

Strengths

- Significant Performance Gains. The method achieves outstanding results on several challenging math and code benchmarks. Compared to the baseline DAPO algorithm, Archer demonstrates significant gains, such as +6.6 Pass@1 on AIME24, and achieves SOTA performance among similarly sized models. - Higher Training Efficiency: The paper reports that unlike other SOTA models that rely on complex multi-stage or multi-round training, Archer achieves its best average accuracy with only single-stage traini

Weaknesses

- Introduction of Hyperparameters. The method introduces several hyperparameters that require careful tuning, including the clipping ranges ($\epsilon^{k}$, $\epsilon^{r}$) and KL weights ($\beta^{k}$, $\beta^{r}$) for both token types. The ablation studies show that model performance is quite sensitive to these values (especially $\beta^{k}$), which may increase the difficulty of reproducing the best results on different tasks or models. - Limited Generalizability of Model and Task. All experim

Code & Models

Repositories

wizard-iii/ArcherCodeR
github

Models

🤗
Fate-Zero/Archer-Code-1.5B
model· 6 dl
6 dl

Datasets

Fate-Zero/Archer-Code-1.5B
dataset· 144 dl
144 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.