TreeCoders: Trees of Transformers

Pierre Colonna D'Istria; Abdulrahman Altahhan

arXiv:2411.07218·cs.CL·November 12, 2024

TreeCoders: Trees of Transformers

Pierre Colonna D'Istria, Abdulrahman Altahhan

PDF

Open Access

TL;DR

TreeCoders introduces a novel tree-structured transformer architecture that improves efficiency and flexibility, enabling sparse activation and distributed implementation, with competitive performance across language datasets.

Contribution

It presents a new tree-based transformer design with external selectors and sparse activation, differing from traditional linear transformers.

Findings

01

Outperforms size-matched linear transformers 76% of the time.

02

Supports sparse node activation with logarithmic complexity.

03

Demonstrates competitive results on diverse language datasets.

Abstract

In this paper, we introduce TreeCoders, a novel family of transformer trees. We moved away from traditional linear transformers to complete k-ary trees. Transformer blocks serve as nodes, and generic classifiers learn to select the best child and route the sequence of tokens to a specific leaf. The selectors, moved outside the transformer blocks, allow for the use of a variety of architecture without further modifications. Furthermore, our proposed architecture supports sparse node activation due to the logarithmic complexity of a tree search. We validate our idea by testing a series of decoder-only tree transformers, achieving competitive results across a diverse range of language datasets. Our study demonstrates that the proposed tree transformer model outperforms a size-equivalent linear transformer model 76\% of the time over a wide range of tree architectures. Furthermore, our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCellular Automata and Applications · Algorithms and Data Compression · Advanced Database Systems and Queries

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Label Smoothing · Absolute Position Encodings · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Multi-Head Attention · Residual Connection