Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR
Jiakang Wang, Runze Liu, Fuzheng Zhang, Xiu Li, Guorui Zhou, Ling Pan

TL;DR
This paper introduces Archer, a dual-token constraint framework for RLVR that enhances reasoning in large language models by differentiating token types during training.
Contribution
Archer is the first entropy-aware RLVR method that preserves sequence dependencies while applying distinct constraints to reasoning and knowledge tokens.
Findings
Archer outperforms strong baselines on mathematical reasoning and code generation tasks.
It improves pass@1 and pass@K performance across multiple model scales.
Differentiated token constraints lead to better reasoning and knowledge retention.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training method for improving the reasoning abilities of Large Language Models (LLMs). However, existing methods mainly apply uniform optimization constraints across all tokens, ignoring their heterogeneous roles. Prior work shows that high-entropy tokens are closely tied to reasoning, while low-entropy tokens primarily encode factual knowledge, and recent approaches attempt to exploit this distinction by isolating token updates via masking or asynchronous training. We argue that such isolation breaks the sequential dependency structure of autoregressive generation, leading to suboptimal learning. To address this, we propose \textbf{Archer}, an entropy-aware RLVR framework with \textbf{dual-token constraints} that preserves joint optimization while modulating update strength across token types. Our method…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The method is simple. Step-wise entropy is easy to compute and does highlight where models tend to struggle. 2. The paper gives ablations showing that removing KL on low-entropy tokens causes collapse.
1. The core assumption is that entropy reliably separates reasoning from knowledge, which is questionable. Entropy shifts with sampling, style, and prompt form, and generation length. 2. The paper uses some heuristics which needs more justification to claim effectiveness, such as the fixed empirical quantile threshold for entropy. 3. The evaluation is narrow (math, code) and does not test whether the method generalizes to less structured reasoning. I feel them not convicing since we all know t
1. This proposed method adds a clear, implementation‑friendly mechanism—token‑typed constraints from response‑level entropy—that integrates seamlessly with existing GRPO/DAPO setups. The decision to keep synchronous updates while relaxing constraints on reasoning tokens is well motivated, and the visual analysis of optimization regions and token interleaving makes the intuition concrete. 2. The paper demonstrates consistent improvements across math and code benchmarks, suggesting the approach b
1. All results rely on a single 1.5B base (DeepSeek‑R1‑Distill‑Qwen‑1.5B); adding a second backbone or a larger‑scale model would strengthen generality. Further, The entropy quantile (ρ) is fixed without a sensitivity study, and the work would benefit from either a small sweep or an adaptive rule of thumb. 2. No direct head‑to‑head with token masking/asynchronous baselines. The paper critiques these strategies but does not supply a controlled replication (same data/compute) of a recent masking
- Significant Performance Gains. The method achieves outstanding results on several challenging math and code benchmarks. Compared to the baseline DAPO algorithm, Archer demonstrates significant gains, such as +6.6 Pass@1 on AIME24, and achieves SOTA performance among similarly sized models. - Higher Training Efficiency: The paper reports that unlike other SOTA models that rely on complex multi-stage or multi-round training, Archer achieves its best average accuracy with only single-stage traini
- Introduction of Hyperparameters. The method introduces several hyperparameters that require careful tuning, including the clipping ranges ($\epsilon^{k}$, $\epsilon^{r}$) and KL weights ($\beta^{k}$, $\beta^{r}$) for both token types. The ablation studies show that model performance is quite sensitive to these values (especially $\beta^{k}$), which may increase the difficulty of reproducing the best results on different tasks or models. - Limited Generalizability of Model and Task. All experim
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
