How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum
Chu-Cheng Lin, Eugene Ie

TL;DR
This paper introduces a unified loss family based on the Tsallis $q$-logarithm to explain and improve the training dynamics of reasoning models, addressing cold start issues and noise robustness.
Contribution
It provides a theoretical framework unifying RLVR and density estimation, and proposes new estimators GARL and PAFT that mitigate cold start problems in reasoning model training.
Findings
GARL at high $q$ mitigates cold-start stalling effectively.
GARL at low $q$ outperforms existing methods in warm start scenarios.
PAFT at $q=0.75$ achieves high performance on HotPotQA.
Abstract
SFT-then-RLVR is widely used for post-training reasoning models, but why this specific ordering, and why RLVR-only stalls at cold start, have lacked a unifying theoretical account. We provide that account under a unified loss family using the Tsallis -logarithm. is a single-parameter family that interpolates between RLVR (at , the \textit{exploitation pole}) and the log-marginal-likelihood over latent trajectories (at , the \textit{density-estimation pole}), under which the standard pipeline corresponds to a stepwise schedule. All members share the same per-example gradient direction, differing only by a per-instance amplification that reweights each instance independently of the learning rate. Under gradient flow analysis, we show that the exploitation pole requires time to escape cold start but is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
