Combiner: Full Attention Transformer with Sparse Computation Cost

Hongyu Ren; Hanjun Dai; Zihang Dai; Mengjiao Yang; Jure Leskovec; Dale; Schuurmans; Bo Dai

arXiv:2107.05768·cs.LG·October 29, 2021·28 cites

Combiner: Full Attention Transformer with Sparse Computation Cost

Hongyu Ren, Hanjun Dai, Zihang Dai, Mengjiao Yang, Jure Leskovec, Dale, Schuurmans, Bo Dai

PDF

Open Access 2 Repos 1 Video

TL;DR

Combiner introduces a novel full attention transformer that maintains expressiveness while significantly reducing computational complexity to sub-quadratic levels, enabling efficient processing of very long sequences.

Contribution

It proposes a structured factorization approach allowing full attention with low complexity, outperforming existing sparse methods in expressiveness and efficiency.

Findings

01

Achieves state-of-the-art results on image and text tasks.

02

Maintains full attention capability with sub-quadratic cost.

03

Easily integrable as a drop-in replacement in existing transformers.

Abstract

Transformers provide a class of expressive architectures that are extremely effective for sequence modeling. However, the key limitation of transformers is their quadratic memory and time complexity $O (L^{2})$ with respect to the sequence length in attention layers, which restricts application in extremely long sequences. Most existing approaches leverage sparsity or low-rank assumptions in the attention matrix to reduce cost, but sacrifice expressiveness. Instead, we propose Combiner, which provides full attention capability in each attention head while maintaining low computation and memory complexity. The key idea is to treat the self-attention mechanism as a conditional expectation over embeddings at each location, and approximate the conditional distribution with a structured factorization. Each location can attend to all other locations, either via direct attention, or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Combiner: Full Attention Transformer with Sparse Computation Cost· slideslive

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Advanced Graph Neural Networks