Markovian Transformers for Informative Language Modeling
Scott Viteri, Max Lamparth, Peter Chatain, Clark Barrett

TL;DR
This paper introduces a Markovian language model framework with a reasoning bottleneck that ensures the model's answers are derived solely from natural-language reasoning steps, improving interpretability and causal reliance on reasoning chains.
Contribution
It proposes a novel Markovian framework with an autoencoder-style bottleneck for reasoning, trained with a specialized policy gradient algorithm, enhancing interpretability and transferability of reasoning steps.
Findings
Significant improvement in QA task accuracy with Markovian training.
Models show stronger causal reliance on reasoning chains under perturbation.
Cross-architecture generalization of learned reasoning steps.
Abstract
Chain-of-Thought (CoT) reasoning often fails to faithfully reflect a language model's underlying decision process. We address this by introducing a Markovian language model framework with an autoencoder-style reasoning bottleneck: all information flowing from question to answer must pass through a bounded-length CoT, creating a bandwidth bottleneck analogous to the latent layer of an autoencoder. In practice, the KL penalty toward the pretrained distribution and the inductive biases of gradient descent discourage steganographic encoding, so the model learns to express its reasoning in natural-language steps from which the answer can be derived. We train this system with a GRPO-style policy gradient algorithm using parallel sampling, a frozen baseline CoT, within-batch standardized advantages, and actor-reward (chain-rule) gradients. On QA tasks, Markovian training recovers most of the…
Peer Reviews
Decision·ICLR 2026 Poster
- The idea of enforcing a Markovian structure to make CoTs causally essential is novel, conceptually elegant, and well-motivated. - The introduction of informativeness as a learning objective is interesting and moves beyond traditional notions of faithfulness or interpretability. - The formalisation of the Markovian LM and integration of actor–reward gradients (where the reward depends on the same model parameters) are technically sound and well-presented. - Empirical results show strong and
- It is unclear which components of the method (Markovian bottleneck, actor–reward coupling, within-batch normalisation, or reward design) are responsible for the observed gains. The paper should include controlled ablations to isolate these effects. - The informativeness criterion works well for deductive or mathematical reasoning tasks where the CoT captures logical steps. However, for non-deductive or knowledge-grounded tasks (e.g., MMLU, factual QA), informativeness alone may be insufficien
It is a nice and elegant idea to introduce causal reliance by construction. The distinction between "faithfulness" and "informativeness" is pragmatic and operationalizable. The reinforcement learning formulation seems sound and there is good improvement after training. There is interesting cross-model generalization where learned CoTs in one architecture transfer to another architectures
The paper oscillates between two different stories: compression (Wikipedia experiment) and sufficiency (for QA experiment). The experiments are a bit superficial, there is only one LLaMA model and no baselines such as SFT and GRPO (yet the appendix F anyway shows some Wikipedia results for other models, then why not report results on the QA dataset as well?). It would be necessary to compare to other post-training baselines, especially on these datasets where any type of reinforcement learning
- The problem (chain-of-thought unfaithfulness) is of broad interest to the field - The method design seems novel and creative. It also seemed tricky to implement and the authors developed training strategies to make it work.
- Although the authors state that the approach improves performance on benchmarks, I don't think they compared to a baseline-- eg, what if you just did the training but didn't have the attention to the original question blocked? - It would also be nice to have some qualitative discussions of the chains of thought that result from this training procedure, whether through example transcripts or through some grading of readability.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsLLaMA
