Residual Matrix Transformers: Scaling the Size of the Residual Stream
Brian Mak, Jeffrey Flanigan

TL;DR
The Residual Matrix Transformer (RMT) replaces the traditional residual stream with a matrix-based memory, enabling more efficient scaling, fewer resources, and better performance on downstream tasks.
Contribution
This paper introduces the RMT, a novel transformer variant that scales the residual stream independently, reducing compute and parameters while maintaining or improving performance.
Findings
RMT achieves same loss with 58% fewer FLOPS
RMT uses 25% fewer parameters
RMT outperforms traditional transformers on downstream tasks
Abstract
The residual stream acts as a memory bus where transformer layers both store and access features (Elhage et al., 2021). We consider changing the mechanism for retrieving and storing information in the residual stream, and replace the residual stream of the transformer with an outer product memory matrix (Kohonen, 1972, Anderson, 1972). We call this model the Residual Matrix Transformer (RMT). We find that the RMT enjoys a number of attractive properties: 1) the size of the residual stream can be scaled independently of compute and model size, improving performance, 2) the RMT can achieve the same loss as the transformer with 58% fewer FLOPS, 25% fewer parameters, and 41% fewer training tokens tokens, and 3) the RMT outperforms the transformer on downstream evaluations. We theoretically analyze the transformer and the RMT, and show that the RMT allows for more efficient scaling of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Ferroelectric and Negative Capacitance Devices
