Fast and Simplex: 2-Simplicial Attention in Triton

Aurko Roy; Timothy Chou; Sai Surya Duvvuri; Sijia Chen; Jiecao Yu; Xiaodong Wang; Manzil Zaheer; Rohan Anil

arXiv:2507.02754·cs.LG·July 4, 2025

Fast and Simplex: 2-Simplicial Attention in Triton

Aurko Roy, Timothy Chou, Sai Surya Duvvuri, Sijia Chen, Jiecao Yu, Xiaodong Wang, Manzil Zaheer, Rohan Anil

PDF

TL;DR

This paper introduces the 2-simplicial Transformer with trilinear attention, which improves token efficiency over standard Transformers, especially for reasoning and knowledge tasks, by altering the scaling laws.

Contribution

It presents the 2-simplicial Transformer architecture with an efficient Triton kernel, demonstrating improved token efficiency and altered scaling laws for reasoning tasks.

Findings

01

2-simplicial Transformer outperforms standard Transformers on reasoning tasks.

02

Achieves better token efficiency at fixed model size and token budget.

03

Changes the exponent in the scaling laws for knowledge and reasoning tasks.

Abstract

Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count together. However, these scaling laws assume an infinite supply of data and apply primarily in compute-bound settings. As modern large language models increasingly rely on massive internet-scale datasets, the assumption that they are compute-bound is becoming less valid. This shift highlights the need for architectures that prioritize token efficiency. In this work, we investigate the use of the 2-simplicial Transformer, an architecture that generalizes standard dot-product attention to trilinear functions through an efficient Triton kernel implementation. We demonstrate that the 2-simplicial Transformer achieves better token efficiency than standard Transformers: for a fixed token budget,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDropout · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Dense Connections · Softmax · Transformer