Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning
Zixuan Xie, Xinyu Liu, Claire Chen, Shuze Daniel Liu, Rohan Chandra, Shangtong Zhang

TL;DR
This paper provides a theoretical analysis of softmax attention in Transformers for in-context reinforcement learning, showing its equivalence to a weighted softmax TD learning algorithm and analyzing error decay.
Contribution
It offers the first theoretical understanding of softmax attention in ICRL, connecting it to a new RL algorithm and analyzing parameter effects on learning error.
Findings
Softmax attention in Transformers is equivalent to weighted softmax TD updates.
Policy evaluation error decreases with more layers under certain conditions.
Identified parameters minimize pretraining loss, explaining their emergence.
Abstract
In-context reinforcement learning (ICRL) studies agents that, after pretraining, adapt to new tasks by conditioning on additional context without parameter updates. Existing theoretical analyses of ICRL largely rely on linear attention, which replaces the softmax function in the standard attention with an identity mapping. This paper provides the first theoretical understanding of ICRL without making the unrealistic linear attention simplification. In particular, we consider the standard softmax attention used in practice. We show that, with certain parameters, the layerwise forward pass of a Transformer with such softmax attention is equivalent to iterative updates of a weighted softmax temporal difference (TD) learning algorithm. Here, weighted softmax TD is a new RL algorithm that performs policy evaluation in kernel space and adopts both linear TD and tabular TD as special cases. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
