Transformers from an Optimization Perspective

Yongyi Yang; Zengfeng Huang; David Wipf

arXiv:2205.13891·cs.LG·February 28, 2023·6 cites

Transformers from an Optimization Perspective

Yongyi Yang, Zengfeng Huang, David Wipf

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates whether Transformers can be viewed as an optimization process by identifying an underlying energy function, providing a new interpretability perspective for these complex models.

Contribution

It introduces techniques to associate energy function minimization with Transformer layers, bridging the gap in understanding self-attention models from an optimization perspective.

Findings

01

Established a connection between energy minimization and Transformer layers

02

Demonstrated the feasibility of interpreting Transformers as unfolding optimization processes

03

Provided foundational techniques to analyze complex deep models through an energy function lens

Abstract

Deep learning models such as the Transformer are often constructed by heuristics and experience. To provide a complementary foundation, in this work we study the following problem: Is it possible to find an energy function underlying the Transformer model, such that descent steps along this energy correspond with the Transformer forward pass? By finding such a function, we can view Transformers as the unfolding of an interpretable optimization process across iterations. This unfolding perspective has been frequently adopted in the past to elucidate more straightforward deep models such as MLPs and CNNs; however, it has thus far remained elusive obtaining a similar equivalence for more complex models with self-attention mechanisms like the Transformer. To this end, we first outline several major obstacles before providing companion techniques to at least partially address them,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fftyyy/transformers-from-optimization
pytorchOfficial

Videos

Transformers from an Optimization Perspective· slideslive

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Machine Learning in Materials Science · Machine Learning and Data Classification

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Softmax · Dense Connections · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Multi-Head Attention