Transformers from an Optimization Perspective
Yongyi Yang, Zengfeng Huang, David Wipf

TL;DR
This paper investigates whether Transformers can be viewed as an optimization process by identifying an underlying energy function, providing a new interpretability perspective for these complex models.
Contribution
It introduces techniques to associate energy function minimization with Transformer layers, bridging the gap in understanding self-attention models from an optimization perspective.
Findings
Established a connection between energy minimization and Transformer layers
Demonstrated the feasibility of interpreting Transformers as unfolding optimization processes
Provided foundational techniques to analyze complex deep models through an energy function lens
Abstract
Deep learning models such as the Transformer are often constructed by heuristics and experience. To provide a complementary foundation, in this work we study the following problem: Is it possible to find an energy function underlying the Transformer model, such that descent steps along this energy correspond with the Transformer forward pass? By finding such a function, we can view Transformers as the unfolding of an interpretable optimization process across iterations. This unfolding perspective has been frequently adopted in the past to elucidate more straightforward deep models such as MLPs and CNNs; however, it has thus far remained elusive obtaining a similar equivalence for more complex models with self-attention mechanisms like the Transformer. To this end, we first outline several major obstacles before providing companion techniques to at least partially address them,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Machine Learning in Materials Science · Machine Learning and Data Classification
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Softmax · Dense Connections · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Multi-Head Attention
