Post-Training as Reweighting: A Stochastic View of Reasoning Trajectories in Language Models

Dake Bu; Wei Huang; Andi Han; Atsushi Nitanda; Bo Xue; Qingfu Zhang; Hau-San Wong; Taiji Suzuki

arXiv:2511.07368·cs.LG·January 21, 2026

Post-Training as Reweighting: A Stochastic View of Reasoning Trajectories in Language Models

Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Bo Xue, Qingfu Zhang, Hau-San Wong, Taiji Suzuki

PDF

Open Access

TL;DR

This paper models reasoning in language models as stochastic trajectories, revealing how post-training methods reweight reasoning paths and affect the model's ability to handle complex tasks, supported by theoretical and empirical analysis.

Contribution

It introduces a stochastic trajectory framework for understanding post-training reasoning, highlighting how reweighting influences reasoning diversity and task difficulty handling.

Findings

01

Post-training reweights reasoning trajectories, favoring high-probability paths.

02

Rare but crucial reasoning paths are suppressed by common post-training methods.

03

Exploration techniques help preserve low-probability, essential reasoning trajectories.

Abstract

Foundation models encode rich structural knowledge but often rely on post-training procedures to adapt their reasoning behavior to specific tasks. Popular approaches such as reinforcement learning with verifiable rewards (RLVR) and inference-time reward aggregation are typically analyzed from a performance perspective, leaving their effects on the underlying reasoning distribution less understood. In this work, we study post-training reasoning from a stochastic trajectory viewpoint. Following Kim et al. (2025), we model reasoning steps of varying difficulty as Markov transitions with different probabilities, and formalize reasoning processes using tree-structured Markov chains. Within this framework, pretraining corresponds to discovering the reasoning structure, while post-training primarily reweights existing chains of thought. We show that both RLVR and inference-time reward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Reinforcement Learning in Robotics · Ethics and Social Impacts of AI