Untangling tradeoffs between recurrence and self-attention in neural networks
Giancarlo Kerg, Bhargav Kanuparthi, Anirudh Goyal, Kyle Goyette,, Yoshua Bengio, Guillaume Lajoie

TL;DR
This paper provides a formal analysis of self-attention in recurrent networks, showing how it mitigates vanishing gradients and proposing a scalable sparse attention mechanism to balance performance and resource use.
Contribution
It offers a theoretical understanding of attention's role in gradient propagation and introduces a novel relevancy screening mechanism for scalable self-attention in recurrent models.
Findings
Self-attention mitigates vanishing gradients in recurrent networks.
Proposed a relevancy screening mechanism for sparse self-attention.
Demonstrated tradeoffs between attention, recurrence, and computational resources.
Abstract
Attention and self-attention mechanisms, are now central to state-of-the-art deep learning on sequential tasks. However, most recent progress hinges on heuristic approaches with limited understanding of attention's role in model optimization and computation, and rely on considerable memory and computational resources that scale poorly. In this work, we present a formal analysis of how self-attention affects gradient propagation in recurrent networks, and prove that it mitigates the problem of vanishing gradients when trying to capture long-term dependencies by establishing concrete bounds for gradient norms. Building on these results, we propose a relevancy screening mechanism, inspired by the cognitive process of memory consolidation, that allows for a scalable use of sparse self-attention with recurrence. While providing guarantees to avoid vanishing gradients, we use simple numerical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Neural Networks and Reservoir Computing · Reinforcement Learning in Robotics
