Stochastic Parroting in Temporal Attention -- Regulating the Diagonal Sink
Victoria Hankemeier, Malte Schilling

TL;DR
This paper investigates the bias in temporal attention mechanisms caused by over-squashing, introduces sensitivity bounds, and proposes regularization methods to mitigate the diagonal sink effect in spatio-temporal models.
Contribution
The paper provides a theoretical analysis of diagonal attention sink in temporal attention and introduces regularization techniques to address it.
Findings
Diagonal attention matrices suffer from a sink effect.
Regularization methods improve temporal attention performance.
Sensitivity bounds relate off-diagonal scores to sequence length.
Abstract
Spatio-temporal models analyze spatial structures and temporal dynamics, which makes them prone to information degeneration among space and time. Prior literature has demonstrated that over-squashing in causal attention or temporal convolutions creates a bias on the first tokens. To analyze whether such a bias is present in temporal attention mechanisms, we derive sensitivity bounds on the expected value of the Jacobian of a temporal attention layer. We theoretically show how off-diagonal attention scores depend on the sequence length, and that temporal attention matrices suffer a diagonal attention sink. We suggest regularization methods, and experimentally demonstrate their effectiveness.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeuroscience and Music Perception · Neural dynamics and brain function · Neural and Behavioral Psychology Studies
