Residual Matrix Transformers: Scaling the Size of the Residual Stream

Brian Mak; Jeffrey Flanigan

arXiv:2506.22696·cs.LG·July 1, 2025

Residual Matrix Transformers: Scaling the Size of the Residual Stream

Brian Mak, Jeffrey Flanigan

PDF

Open Access 1 Video

TL;DR

The Residual Matrix Transformer (RMT) replaces the traditional residual stream with a matrix-based memory, enabling more efficient scaling, fewer resources, and better performance on downstream tasks.

Contribution

This paper introduces the RMT, a novel transformer variant that scales the residual stream independently, reducing compute and parameters while maintaining or improving performance.

Findings

01

RMT achieves same loss with 58% fewer FLOPS

02

RMT uses 25% fewer parameters

03

RMT outperforms traditional transformers on downstream tasks

Abstract

The residual stream acts as a memory bus where transformer layers both store and access features (Elhage et al., 2021). We consider changing the mechanism for retrieving and storing information in the residual stream, and replace the residual stream of the transformer with an outer product memory matrix (Kohonen, 1972, Anderson, 1972). We call this model the Residual Matrix Transformer (RMT). We find that the RMT enjoys a number of attractive properties: 1) the size of the residual stream can be scaled independently of compute and model size, improving performance, 2) the RMT can achieve the same loss as the transformer with 58% fewer FLOPS, 25% fewer parameters, and 41% fewer training tokens tokens, and 3) the RMT outperforms the transformer on downstream evaluations. We theoretically analyze the transformer and the RMT, and show that the RMT allows for more efficient scaling of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Residual Matrix Transformers: Scaling the Size of the Residual Stream· slideslive

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Ferroelectric and Negative Capacitance Devices