Compute Better Spent: Replacing Dense Layers with Structured Matrices
Shikai Qiu, Andres Potapczynski, Marc Finzi, Micah Goldblum, Andrew, Gordon Wilson

TL;DR
This paper investigates replacing dense layers with structured matrices like Monarch and BTT to improve computational efficiency in foundation models, demonstrating significant performance gains and reduced compute requirements.
Contribution
It introduces the BTT structured matrix family, analyzes initialization and learning rate scaling, and shows BTT's superior efficiency over dense layers in various tasks.
Findings
BTT outperforms dense matrices on multiple tasks.
BTT achieves lower training loss with less compute.
BTT matches dense ViT performance on ImageNet-1k with 3.8x less compute.
Abstract
Dense linear layers are the dominant computational bottleneck in foundation models. Identifying more efficient alternatives to dense matrices has enormous potential for building more compute-efficient models, as exemplified by the success of convolutional networks in the image domain. In this work, we systematically explore structured matrices as replacements for dense matrices. We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance, especially as models scale. Using insights from the Maximal Update Parameterization, we determine the optimal scaling for initialization and learning rates of these unconventional layers. Finally, we measure the scaling laws of different structures to compare how quickly their performance improves with compute. We propose a novel matrix family containing Monarch…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDNA and Biological Computing
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Layer Normalization · Byte Pair Encoding · Adam · Attention Dropout · Weight Decay · Linear Warmup With Cosine Annealing · Linear Layer
