Axial Attention in Multidimensional Transformers
Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, Tim Salimans

TL;DR
The paper introduces Axial Transformers, a novel self-attention-based model for high-dimensional data that achieves state-of-the-art results while maintaining computational efficiency and ease of implementation.
Contribution
It presents axial attention, a new self-attention mechanism that scales efficiently to high-dimensional data and enables state-of-the-art generative modeling performance.
Findings
Achieved state-of-the-art results on ImageNet-32 and ImageNet-64 benchmarks.
Demonstrated effectiveness on the BAIR Robotic Pushing video benchmark.
Maintained full distribution expressiveness with a semi-parallel decoding structure.
Abstract
We propose Axial Transformers, a self-attention-based autoregressive model for images and other data organized as high dimensional tensors. Existing autoregressive models either suffer from excessively large computational resource requirements for high dimensional data, or make compromises in terms of distribution expressiveness or ease of implementation in order to decrease resource requirements. Our architecture, by contrast, maintains both full expressiveness over joint distributions over data and ease of implementation with standard deep learning frameworks, while requiring reasonable memory and computation and achieving state-of-the-art results on standard generative modeling benchmarks. Our models are based on axial attention, a simple generalization of self-attention that naturally aligns with the multiple dimensions of the tensors in both the encoding and the decoding settings.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Axial Attention & MetNet: A Neural Weather Model for Precipitation Forecasting· youtube
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Medical Image Segmentation Techniques
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Axial Attention · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam
