TQL: Scaling Q-Functions with Transformers by Preventing Attention Collapse
Perry Dong, Kuo-Han Hung, Alexander Swerdlow, Dorsa Sadigh, Chelsea Finn

TL;DR
This paper introduces TQL, a method that stabilizes transformer-based value functions in reinforcement learning by preventing attention score collapse through entropy control, enabling effective scaling and significant performance improvements.
Contribution
The paper identifies attention score collapse as a key obstacle in scaling transformers for RL value functions and proposes entropy-based control to stabilize training and improve performance.
Findings
Up to 43% performance improvement with larger models
Attention scores collapse as model capacity increases
Entropy control stabilizes transformer training in RL
Abstract
Despite scale driving substantial recent advancements in machine learning, reinforcement learning (RL) methods still primarily use small value functions. Naively scaling value functions -- including with a transformer architecture, which is known to be highly scalable -- often results in learning instability and worse performance. In this work, we ask what prevents transformers from scaling effectively for value functions? Through empirical analysis, we identify the critical failure mode in this scaling: attention scores collapse as capacity increases. Our key insight is that we can effectively prevent this collapse and stabilize training by controlling the entropy of the attention scores, thereby enabling the use of larger models. To this end, we propose Transformer Q-Learning (TQL), a method that unlocks the scaling potential of transformers in learning value functions in RL. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning
