Modular Duality in Deep Learning

Jeremy Bernstein; Laker Newhouse

arXiv:2410.21265·cs.LG·December 9, 2024

Modular Duality in Deep Learning

Jeremy Bernstein, Laker Newhouse

PDF

Open Access 1 Repo

TL;DR

This paper introduces modular dualization, a theoretical framework for neural network training that maps gradients to the primal space, enabling fast and scalable optimization algorithms.

Contribution

It develops a novel duality map for neural networks based on operator norms, providing a unifying basis for efficient training algorithms.

Findings

01

Derived GPU-friendly dualization algorithms for core layers

02

Used the methods to set speed records for NanoGPT training

03

Proposed a theoretical foundation for next-generation optimizers

Abstract

An old idea in optimization theory says that since the gradient is a dual vector it may not be subtracted from the weights without first being mapped to the primal space where the weights reside. We take this idea seriously in this paper and construct such a duality map for general neural networks. Our map, which we call modular dualization, forms a unifying theoretical basis for training algorithms that are a) fast and b) scalable. Modular dualization involves first assigning operator norms to layers based on the semantics of each layer, and then using these layerwise norms to recursively induce a duality map on the weight space of the full neural architecture. We conclude by deriving GPU-friendly algorithms for dualizing Embed, Linear and Conv2D layers -- the latter two methods are based on a rectangular Newton-Schulz iteration (Kovarik, 1970; Bj\"orck & Bowie, 1971). A variant of our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jxbz/modula
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsSparse Evolutionary Training · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings