CTRLS: Chain-of-Thought Reasoning via Latent State-Transition
Junda Wu, Yuxin Xiong, Xintong Li, Sheldon Yu, Zhengmian Hu, Tong Yu, Rui Wang, Xiang Chen, Jingbo Shang, Julian McAuley

TL;DR
CTRLS introduces a novel framework that models chain-of-thought reasoning as a Markov decision process with latent states, enabling more systematic exploration and improved reasoning performance in large language models.
Contribution
The paper proposes CTRLS, a new approach that explicitly models reasoning as latent state transitions using reinforcement learning, enhancing exploration and reasoning quality without additional fine-tuning.
Findings
Improves reasoning accuracy on benchmark tasks.
Enhances diversity and exploration efficiency.
Provides theoretical grounding with ELBO analysis.
Abstract
Chain-of-thought (CoT) reasoning enables large language models (LLMs) to break down complex problems into interpretable intermediate steps, significantly enhancing model transparency and performance in reasoning tasks. However, conventional CoT methods rely on heuristic sampling without structured modeling of reasoning transitions, constraining their ability to systematically explore and discover diverse and effective reasoning trajectories. In this work, we introduce CTRLS, a framework that formulates CoT reasoning as a Markov decision process (MDP) with latent state transitions, enabling principled and state-aware exploration via distributional reinforcement learning. By modelling reasoning actions as explicit probability distributions in latent space, our approach explicitly models epistemic uncertainty, facilitating robust exploration of the reasoning space. As part of our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
