The Affine Divergence: Aligning Activation Updates Beyond Normalisation
George Bird

TL;DR
This paper introduces the Affine Divergence, a new perspective on activation updates and normalization, proposing alternative normalization methods that outperform traditional ones and offer a fresh mechanistic understanding.
Contribution
It presents a novel theoretical framework for activation updates and normalization, including the PatchNorm method, challenging conventional affine normalization approaches.
Findings
PatchNorm outperforms traditional normalizers in tests.
Theoretical analysis links normalization to activation function maps.
Alternative normalization methods are empirically validated.
Abstract
A systematic mismatch exists between mathematically ideal and effective activation updates during gradient descent. As intended, parameters update in their direction of steepest descent. However, activations are argued to constitute a more directly impactful quantity to prioritise in optimisation, as they are closer to the loss in the computational graph and carry sample-dependent information through the network. Yet their propagated updates do not take the optimal steepest-descent step. These quantities exhibit non-ideal sample-wise scaling across affine, convolutional, and attention layers.Solutions to correct for this are trivial and, incidentally, derive normalisation from first principles despite motivational independence. Consequently, such considerations offer a fresh, conceptual reframe of normalisation's action, with auxiliary experiments bolstering this mechanistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning in Materials Science · Domain Adaptation and Few-Shot Learning
