Orthogonal Self-Attention
Leo Zhang, James Martens

TL;DR
This paper introduces Orthogonal Self-Attention (OSA), a novel attention mechanism designed to improve the stability and trainability of Transformer models by ensuring orthogonality and well-conditioned Jacobians, especially in skipless architectures.
Contribution
The paper proposes OSA, which parametrizes attention matrices as orthogonal via matrix exponential of skew-symmetric matrices, enabling stable training without skip connections.
Findings
OSA can be efficiently implemented with linear complexity.
The proposed initialization ensures well-conditioned Jacobians.
OSA improves training stability in skipless Transformer architectures.
Abstract
Softmax Self-Attention (SSA) is a key component of Transformer architectures. However, when utilised within skipless architectures, which aim to improve representation learning, recent work has highlighted the inherent instability of SSA due to inducing rank collapse and poorly-conditioned Jacobians. In this work, we design a novel attention mechanism: Orthogonal Self-Attention (OSA), which aims to bypass these issues with SSA, in order to allow for (non-causal) Transformers without skip connections and normalisation layers to be more easily trained. In particular, OSA parametrises the attention matrix to be orthogonal via mapping a skew-symmetric matrix, formed from query-key values, through the matrix exponential. We show that this can be practically implemented, by exploiting the low-rank structure of our query-key values, resulting in the computational complexity and memory cost of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Reservoir Computing · Ferroelectric and Negative Capacitance Devices · Parallel Computing and Optimization Techniques
