Attention-Only Transformers via Unrolled Subspace Denoising

Peng Wang; Yifu Lu; Yaodong Yu; Druv Pai; Qing Qu; Yi Ma

arXiv:2506.03790·cs.LG·June 5, 2025

Attention-Only Transformers via Unrolled Subspace Denoising

Peng Wang, Yifu Lu, Yaodong Yu, Druv Pai, Qing Qu, Yi Ma

PDF

Open Access 1 Video

TL;DR

This paper introduces a simplified, interpretable transformer architecture based on unrolled subspace denoising, which achieves competitive performance on vision and language tasks by iteratively refining token representations through self-attention.

Contribution

The authors propose a new transformer design that uses unrolled denoising operations with only self-attention and skip connections, providing mathematical justification and interpretability.

Findings

01

Achieves performance close to GPT-2 and CRATE on vision and language tasks.

02

Each layer denoises token representations at a linear rate.

03

Simplifies transformer architecture to only self-attention with skip connections.

Abstract

Despite the popularity of transformers in practice, their architectures are empirically designed and neither mathematically justified nor interpretable. Moreover, as indicated by many empirical studies, some components of transformer architectures may be redundant. To derive a fully interpretable transformer architecture with only necessary components, we contend that the goal of representation learning is to compress a set of noisy initial token representations towards a mixture of low-dimensional subspaces. To compress these noisy token representations, an associated denoising operation naturally takes the form of a multi-head (subspace) self-attention. By unrolling such iterative denoising operations into a deep network, we arrive at a highly compact architecture that consists of \textit{only} self-attention operators with skip connections at each layer. Moreover, we show that each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Attention-Only Transformers via Unrolled Subspace Denoising· slideslive

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications