TL;DR
BASIS introduces a novel backpropagation method that significantly reduces activation memory requirements while maintaining exact gradient flow, enabling scalable training of deep neural networks with compressed representations.
Contribution
The paper proposes BASIS, a new algorithm that decouples activation memory from batch and sequence dimensions using innovative sketching and bias-variance tradeoff mechanisms.
Findings
BASIS achieves parity with exact backpropagation in validation loss at R=32.
The method enables training with extreme spatial compression (R=1) while maintaining stability.
Code implementation is publicly available at the provided GitHub URL.
Abstract
The activation memory required for exact backpropagation scales linearly with network depth, context length, and feature dimensionality, forming an O(L * BN ) spatial bottleneck (where B is the sequence-batch cardinality and N is the feature dimension). This constraint historically throttles the scaling of deep neural networks. While randomized automatic differentiation attempts to mitigate this, it historically suffers from catastrophic variance. In this paper, we introduce BASIS (Balanced Activation Sketching with Invariant Scalars), an efficient backpropagation algorithm that fully decouples activation memory from the batch and sequence dimensions. BASIS propagates the exact error signal (dX) to preserve flawless gradient flow, but computes the weight updates (dW) using massively compressed rank-R tensors. To solve the foundational instability of sketched gradients, we propose two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
