Attention-Only Transformers via Unrolled Subspace Denoising
Peng Wang, Yifu Lu, Yaodong Yu, Druv Pai, Qing Qu, Yi Ma

TL;DR
This paper introduces a simplified, interpretable transformer architecture based on unrolled subspace denoising, which achieves competitive performance on vision and language tasks by iteratively refining token representations through self-attention.
Contribution
The authors propose a new transformer design that uses unrolled denoising operations with only self-attention and skip connections, providing mathematical justification and interpretability.
Findings
Achieves performance close to GPT-2 and CRATE on vision and language tasks.
Each layer denoises token representations at a linear rate.
Simplifies transformer architecture to only self-attention with skip connections.
Abstract
Despite the popularity of transformers in practice, their architectures are empirically designed and neither mathematically justified nor interpretable. Moreover, as indicated by many empirical studies, some components of transformer architectures may be redundant. To derive a fully interpretable transformer architecture with only necessary components, we contend that the goal of representation learning is to compress a set of noisy initial token representations towards a mixture of low-dimensional subspaces. To compress these noisy token representations, an associated denoising operation naturally takes the form of a multi-head (subspace) self-attention. By unrolling such iterative denoising operations into a deep network, we arrive at a highly compact architecture that consists of \textit{only} self-attention operators with skip connections at each layer. Moreover, we show that each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
