Attention (as Discrete-Time Markov) Chains
Yotam Erel, Olaf D\"unkel, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, Amit H. Bermano

TL;DR
This paper presents a novel interpretation of attention mechanisms in transformers as discrete-time Markov chains, enabling new insights, improved segmentation, and enhanced image generation through the analysis of metastable states and TokenRank.
Contribution
It introduces a Markov chain perspective on attention, allowing analysis of token importance and attention dynamics, leading to state-of-the-art zero-shot segmentation and improved image generation.
Findings
Metastable states correspond to semantically similar regions.
TokenRank improves image generation quality and diversity.
The framework enhances segmentation performance on benchmarks.
Abstract
We introduce a new interpretation of the attention matrix as a discrete-time Markov chain. Our interpretation sheds light on common operations involving attention scores such as selection, summation, and averaging in a unified framework. It further extends them by considering indirect attention, propagated through the Markov chain, as opposed to previous studies that only model immediate effects. Our key observation is that tokens linked to semantically similar regions form metastable states, i.e., regions where attention tends to concentrate, while noisy attention scores dissipate. Metastable states and their prevalence can be easily computed through simple matrix multiplication and eigenanalysis, respectively. Using these lightweight tools, we demonstrate state-of-the-art zero-shot segmentation. Lastly, we define TokenRank -- the steady state vector of the Markov chain, which measures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications
