Treeformer: Dense Gradient Trees for Efficient Attention Computation

Lovish Madaan; Srinadh Bhojanapalli; Himanshu Jain; Prateek Jain

arXiv:2208.09015·cs.CL·March 20, 2023

Treeformer: Dense Gradient Trees for Efficient Attention Computation

Lovish Madaan, Srinadh Bhojanapalli, Himanshu Jain, Prateek Jain

PDF

Open Access 1 Video

TL;DR

Treeformer introduces a hierarchical decision tree approach to reduce attention computation complexity in transformers, achieving near-logarithmic retrieval costs and significant efficiency gains while maintaining high accuracy on NLP tasks.

Contribution

The paper presents Treeformer, a novel transformer architecture using decision trees for efficient attention, with two attention layers and a bootstrapped training method, improving speed and accuracy.

Findings

01

Achieves 30x reduction in FLOPs for attention layer.

02

Maintains comparable accuracy to baseline transformers.

03

Outperforms Linformer by up to 12% in accuracy with similar FLOPs.

Abstract

Standard inference and training with transformer based architectures scale quadratically with input sequence length. This is prohibitively large for a variety of applications especially in web-page translation, query-answering etc. Consequently, several approaches have been developed recently to speedup attention computation by enforcing different attention structures such as sparsity, low-rank, approximating attention using kernels. In this work, we view attention computation as that of nearest neighbor retrieval, and use decision tree based hierarchical navigation to reduce the retrieval cost per query token from linear in sequence length to nearly logarithmic. Based on such hierarchical navigation, we design Treeformer which can use one of two efficient attention layers -- TF-Attention and TC-Attention. TF-Attention computes the attention in a fine-grained style, while TC-Attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Treeformer: Dense Gradient Trees for Efficient Attention Computation· slideslive

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Topic Modeling

MethodsAttention Is All You Need · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout · Softmax · Multi-Head Attention · Adam · Linear Layer · Multi-Head Linear Attention