Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States
Yurun Yuan, Tengyang Xie

TL;DR
This paper demonstrates that reintroducing explicit Markov states into reinforcement learning for large language models can overcome the existing capability ceiling, enabling more novel reasoning and discovery.
Contribution
It introduces the use of structured Markovian representations in LLM post-training, providing theoretical guarantees and empirical evidence of improved performance.
Findings
Markov states reduce sample complexity in LLM RL
Explicit Markov states break performance ceilings
Structured states enhance reasoning capabilities
Abstract
Reinforcement learning (RL) has become a standard paradigm for post-training and aligning Large Language Models (LLMs), yet recent evidence suggests it faces a persistent "capability ceiling": unlike classical RL systems that discover novel strategies, RL for LLMs often acts as a mere refiner of patterns already latent in pre-trained weights. In this work, we identify a fundamental structural bottleneck: while classical RL relies on compact, informative Markov states, current LLM post-training formulations are tethered to an ever-expanding history of actions. We revisit a classical principle long central to RL yet absent from LLM post-training: explicit Markov states. Theoretically, we provide rigorous guarantees demonstrating that leveraging estimated Markov states can significantly reduce sample complexity. Empirically, we show that introducing Markov states consistently breaks the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Machine Learning and Algorithms · Topic Modeling
