How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

Chu-Cheng Lin; Eugene Ie

arXiv:2604.25907·cs.LG·May 8, 2026

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

Chu-Cheng Lin, Eugene Ie

PDF

TL;DR

This paper introduces a unified loss family based on the Tsallis $q$-logarithm to explain and improve the training dynamics of reasoning models, addressing cold start issues and noise robustness.

Contribution

It provides a theoretical framework unifying RLVR and density estimation, and proposes new estimators GARL and PAFT that mitigate cold start problems in reasoning model training.

Findings

01

GARL at high $q$ mitigates cold-start stalling effectively.

02

GARL at low $q$ outperforms existing methods in warm start scenarios.

03

PAFT at $q=0.75$ achieves high performance on HotPotQA.

Abstract

SFT-then-RLVR is widely used for post-training reasoning models, but why this specific ordering, and why RLVR-only stalls at cold start, have lacked a unifying theoretical account. We provide that account under a unified loss family $J_{Q}$ using the Tsallis $q$ -logarithm. $J_{Q}$ is a single-parameter family that interpolates between RLVR (at $q = 0$ , the \textit{exploitation pole}) and the log-marginal-likelihood over latent trajectories (at $q = 1$ , the \textit{density-estimation pole}), under which the standard pipeline corresponds to a stepwise $q = 1 \to 0$ schedule. All members share the same per-example gradient direction, differing only by a per-instance amplification $P_{θ}^{- q}$ that reweights each instance independently of the learning rate. Under gradient flow analysis, we show that the exploitation pole requires $Ω (\frac{1}{p _{0}})$ time to escape cold start but is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.