Continuous Chain of Thought Enables Parallel Exploration and Reasoning
Halil Alperen Gozeten, M. Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, Samet Oymak

TL;DR
This paper introduces CoT2, a continuous token approach for language models that enhances reasoning and search capabilities, supported by theoretical guarantees, new algorithms, and empirical improvements over traditional discrete methods.
Contribution
It provides the first theoretical analysis and algorithms for continuous chain-of-thought reasoning, demonstrating improved inference efficiency and reasoning abilities in language models.
Findings
CoT2 enables parallel tracking of multiple reasoning traces.
Optimal parallelism depends on embedding dimension.
Supervised CoT2 outperforms discrete and continuous baselines.
Abstract
Modern language models generate chain-of-thought traces by autoregressively sampling tokens from a finite vocabulary. While this discrete sampling has achieved remarkable success, conducting chain-of-thought with continuously-valued tokens (CoT2) offers a richer and more expressive alternative. Our work provides new theoretical guarantees and algorithms for CoT2, motivated by logical reasoning tasks that inherently require search capabilities. Theoretically, we establish how CoT2 facilitates the model to track multiple discrete traces in parallel; and quantify the level of achievable parallelism and its benefits for inference efficiency. We also provide a CoT2-based one-layer transformer construction that solves the combinatorial "subset sum problem" given a sufficient embedding dimension. These insights arise from a novel and effective supervision strategy where we match the language…
Peer Reviews
Decision·ICLR 2026 Poster
* The proposed CoT2 method is novel and simple to implement on top of existing models. * Since each continuous token is a superposition of discrete tokens, the CoT should be more interpretable compared to methods which simply append the hidden representations. * Theoretical analysis supports the intuitive benefits of tracking multiple traces. In particular, the increased information content is reflected in improved sample complexity. * It is demonstrated that GRPO with MTS can adapt models to co
* While each continuous token should carry $K$ times more information than a discrete token with MTS as demonstrated by the sample complexity results, we also need to sample $K$ times more tokens when decoding. What is the tradeoff in terms of actual runtime (during SFT, decoding, RL, etc)? * While a transformer construction solving MNNS with CoT2 is provided in Proposition 2, there is no lower bound for ordinary CoT, os it is unclear whether this constitutes an actual improvement over discrete
- The framework offers a theoretically principled formulation of parallel reasoning within neural sequence models, on a scale that is testable with single GPUs. - The CSFT objective is mathematically clean, converting multi-trajectory supervision into a single convex target, and the accompanying proofs clarify the representational capacity of continuous token mixtures. - The paper establishes a precise connection between continuous probabilistic reasoning and information-theoretic efficiency, s
- The study’s experimental scope and scale are limited. All models are toy-sized and trained from scratch with pruned vocabularies on synthetic tasks and symbolized formulations that departure significantly from natural languages, making it unclear whether the proposed principles extend to large, pretrained language models operating over natural text. - For reasons explained above, I am not convinced that the comparisons against conventional CoT and COCONUT are fair. The uncontrolled and unfair
The work is well presented with small-scale experiments, toy setups, and theoretical results to convey the CoT2 idea (especially the continuous token construction and fine-tuning/RL) and to build intuition about its effectiveness. The authors also discuss potential RL-based extensions for sampling and policy optimization with continuous-valued tokens, which can also inspire future efforts.
The extension of this approach to practical scenarios is not clear/discussed. Since CSFT requires ground truth $\alpha^*$ for each step, it is unclear how one can compute them for real-world data. There is also no evidence of the benefit of scaling horizontally with more CoT2 tokens in the paper, whereas standard CoT + majority@K (and increasing K) can indeed surpass CoT2.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsSoftmax · Sparse Evolutionary Training
