Convergence and Emergence of In-Context Reinforcement Learning with Chain of Thought
Zixuan Xie, Xinyu Liu, Rohan Chandra, Shangtong Zhang

TL;DR
This paper provides a theoretical analysis of how Chain-of-Thought enhances in-context reinforcement learning, showing convergence properties and linking CoT to temporal difference learning in a linear Transformer setup.
Contribution
It offers the first theoretical framework explaining the interaction between CoT and ICRL, including convergence analysis and parameter optimality in Transformers.
Findings
CoT generation is equivalent to temporal difference updates under certain Transformer parameters.
Policy evaluation error decreases geometrically with CoT length.
Transformer parameters that enable CoT are global minimizers of pretraining loss.
Abstract
In-context reinforcement learning (ICRL) refers to the ability of RL agents to adapt to new tasks at inference time without parameter updates by conditioning on additional context. Recent empirical studies further demonstrate that Chain-of-Thought (CoT) generation can amplify this ICRL capability. This paper is the first to provide a theoretical understanding on how CoT interacts with ICRL. We conduct our analysis in a policy evaluation setup with linear Transformer. We prove that with specific Transformer parameters, the CoT generation process is equivalent to repeatedly executing temporal difference learning updates. Additionally, we provide finite sample convergence analysis showing that the policy evaluation error decreases geometrically with CoT length and eventually saturates at a statistical floor determined by the context length. We also prove that the desired Transformer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
