TL;DR
This paper develops a theoretical framework called thought MDPs to understand when model-free reinforcement learning leads to thinking-like behaviors, supported by experiments with language models and toy domains.
Contribution
It introduces the thought MDP model, proves the importance of policy initialization, and demonstrates conditions under which thinking emerges in language models and toy environments.
Findings
Thought actions are linked to policy improvement steps.
Open-source LLMs meet conditions for thinking-like behavior.
Thought actions improve data efficiency in toy RL tasks.
Abstract
Recent work on large language models has demonstrated the use of model-free reinforcement learning (RL) to train reasoning-like capabilities. The emergence of "thinking" through model-free RL is interesting as thinking actions neither produce reward nor change the external world state to one where the agent is more likely to get reward. This paper seeks to build a domain-independent understanding of when model-free RL will lead to such "thinking" as a strategy for reward maximization. To build this understanding, we first introduce a theoretical model which we call a thought Markov decision process (MDP). Thought MDPs minimally extend the classical MDP model to include an abstract notion of thought state and thought action. Using the thought MDP model, we prove the importance of policy initialization in determining whether or not thinking emerges and show formally that thought actions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
