Rethinking Transformers in Solving POMDPs
Chenhao Lu, Ruizhe Shi, Yuyao Liu, Kaizhe Hu, Simon S. Du, and Huazhe Xu

TL;DR
This paper critically examines the limitations of Transformers in solving POMDPs, demonstrating their theoretical shortcomings and proposing a recurrent alternative, the Deep Linear Recurrent Unit, which outperforms Transformers in empirical tests.
Contribution
It reveals the theoretical limitations of Transformers in modeling POMDPs and introduces the Deep Linear Recurrent Unit as a more effective architecture for partially observable RL tasks.
Findings
Transformers struggle to model regular languages reducible to POMDPs.
Deep Linear Recurrent Units outperform Transformers in empirical evaluations.
Transformers lack the recurrence needed for effective POMDP learning.
Abstract
Sequential decision-making algorithms such as reinforcement learning (RL) in real-world scenarios inevitably face environments with partial observability. This paper scrutinizes the effectiveness of a popular architecture, namely Transformers, in Partially Observable Markov Decision Processes (POMDPs) and reveals its theoretical limitations. We establish that regular languages, which Transformers struggle to model, are reducible to POMDPs. This poses a significant challenge for Transformers in learning POMDP-specific inductive biases, due to their lack of inherent recurrence found in other models like RNNs. This paper casts doubt on the prevalent belief in Transformers as sequence models for RL and proposes to introduce a point-wise recurrent structure. The Deep Linear Recurrent Unit (LRU) emerges as a well-suited alternative for Partially Observable RL, with empirical results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDNA and Biological Computing
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections
