SPINE: Token-Selective Test-Time Reinforcement Learning with Entropy-Band Regularization
Jianghao Wu, Yasmeen George, Jin Ye, Yicheng Wu, Daniel F. Schmidt, Jianfei Cai

TL;DR
SPINE introduces a token-selective reinforcement learning method that enhances test-time reasoning in large language models by focusing updates on critical decision points and using entropy regularization, leading to more stable and effective adaptation.
Contribution
It proposes a novel token-selective test-time reinforcement learning framework that improves reasoning stability and performance without requiring labels or reward models.
Findings
Consistently improves Pass@1 across eight benchmarks.
Avoids response-length collapse during training.
Enhances stability of test-time reasoning in LLMs and MLLMs.
Abstract
Large language models (LLMs) and multimodal LLMs (MLL-Ms) excel at chain-of-thought reasoning but face distribution shift at test-time and a lack of verifiable supervision. Recent test-time reinforcement learning (TTRL) methods derive label-free pseudo-rewards from self-consistency voting over sampled trajectories, yet they often collapse: the majority-vote reward prevails, responses shorten, and Pass@1 declines. We trace this to uniform sequence updates in which most tokens are low-entropy followers, while a small high-entropy subset determines the reasoning branches. Thus we propose \method, a token-selective test-time reinforcement learning framework that (i) performs distribution-aware forking-token selection to update only decision-critical branch points, and (ii) applies a robust entropy-band regularizer at those tokens to prevent premature collapse and suppress noisy drift.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
