Fast and Simplex: 2-Simplicial Attention in Triton
Aurko Roy, Timothy Chou, Sai Surya Duvvuri, Sijia Chen, Jiecao Yu, Xiaodong Wang, Manzil Zaheer, Rohan Anil

TL;DR
This paper introduces the 2-simplicial Transformer with trilinear attention, which improves token efficiency over standard Transformers, especially for reasoning and knowledge tasks, by altering the scaling laws.
Contribution
It presents the 2-simplicial Transformer architecture with an efficient Triton kernel, demonstrating improved token efficiency and altered scaling laws for reasoning tasks.
Findings
2-simplicial Transformer outperforms standard Transformers on reasoning tasks.
Achieves better token efficiency at fixed model size and token budget.
Changes the exponent in the scaling laws for knowledge and reasoning tasks.
Abstract
Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count together. However, these scaling laws assume an infinite supply of data and apply primarily in compute-bound settings. As modern large language models increasingly rely on massive internet-scale datasets, the assumption that they are compute-bound is becoming less valid. This shift highlights the need for architectures that prioritize token efficiency. In this work, we investigate the use of the 2-simplicial Transformer, an architecture that generalizes standard dot-product attention to trilinear functions through an efficient Triton kernel implementation. We demonstrate that the 2-simplicial Transformer achieves better token efficiency than standard Transformers: for a fixed token budget,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDropout · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Dense Connections · Softmax · Transformer
