Treeformer: Dense Gradient Trees for Efficient Attention Computation
Lovish Madaan, Srinadh Bhojanapalli, Himanshu Jain, Prateek Jain

TL;DR
Treeformer introduces a hierarchical decision tree approach to reduce attention computation complexity in transformers, achieving near-logarithmic retrieval costs and significant efficiency gains while maintaining high accuracy on NLP tasks.
Contribution
The paper presents Treeformer, a novel transformer architecture using decision trees for efficient attention, with two attention layers and a bootstrapped training method, improving speed and accuracy.
Findings
Achieves 30x reduction in FLOPs for attention layer.
Maintains comparable accuracy to baseline transformers.
Outperforms Linformer by up to 12% in accuracy with similar FLOPs.
Abstract
Standard inference and training with transformer based architectures scale quadratically with input sequence length. This is prohibitively large for a variety of applications especially in web-page translation, query-answering etc. Consequently, several approaches have been developed recently to speedup attention computation by enforcing different attention structures such as sparsity, low-rank, approximating attention using kernels. In this work, we view attention computation as that of nearest neighbor retrieval, and use decision tree based hierarchical navigation to reduce the retrieval cost per query token from linear in sequence length to nearly logarithmic. Based on such hierarchical navigation, we design Treeformer which can use one of two efficient attention layers -- TF-Attention and TC-Attention. TF-Attention computes the attention in a fine-grained style, while TC-Attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Topic Modeling
MethodsAttention Is All You Need · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout · Softmax · Multi-Head Attention · Adam · Linear Layer · Multi-Head Linear Attention
