Compute Better Spent: Replacing Dense Layers with Structured Matrices

Shikai Qiu; Andres Potapczynski; Marc Finzi; Micah Goldblum; Andrew; Gordon Wilson

arXiv:2406.06248·cs.LG·June 11, 2024

Compute Better Spent: Replacing Dense Layers with Structured Matrices

Shikai Qiu, Andres Potapczynski, Marc Finzi, Micah Goldblum, Andrew, Gordon Wilson

PDF

Open Access 1 Repo

TL;DR

This paper investigates replacing dense layers with structured matrices like Monarch and BTT to improve computational efficiency in foundation models, demonstrating significant performance gains and reduced compute requirements.

Contribution

It introduces the BTT structured matrix family, analyzes initialization and learning rate scaling, and shows BTT's superior efficiency over dense layers in various tasks.

Findings

01

BTT outperforms dense matrices on multiple tasks.

02

BTT achieves lower training loss with less compute.

03

BTT matches dense ViT performance on ImageNet-1k with 3.8x less compute.

Abstract

Dense linear layers are the dominant computational bottleneck in foundation models. Identifying more efficient alternatives to dense matrices has enormous potential for building more compute-efficient models, as exemplified by the success of convolutional networks in the image domain. In this work, we systematically explore structured matrices as replacements for dense matrices. We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance, especially as models scale. Using insights from the Maximal Update Parameterization, we determine the optimal scaling for initialization and learning rates of these unconventional layers. Finally, we measure the scaling laws of different structures to compare how quickly their performance improves with compute. We propose a novel matrix family containing Monarch…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shikaiqiu/compute-better-spent
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDNA and Biological Computing

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Layer Normalization · Byte Pair Encoding · Adam · Attention Dropout · Weight Decay · Linear Warmup With Cosine Annealing · Linear Layer