TL;DR
This paper investigates the intrinsic weight drift caused by positive activation functions in neural networks, revealing its impact on activation sparsity and proposing clipping techniques to improve training stability and performance.
Contribution
It provides a theoretical analysis of negative weight drift due to positive activation functions, characterizes its effects across architectures, and introduces clipping methods to mitigate adverse spikes.
Findings
Weight drift drives weights toward negative values during early training.
Activation sparsity can reach up to 90% in GPT-nano due to weight drift.
Clipping ReLU$^2$ mitigates activation spikes and improves validation loss.
Abstract
The design of modern neural architectures has converged through incremental empirical choices, yet the mechanisms governing their training dynamics remain only partially understood. We identify and analyze a negative weight drift induced by the interaction between standard losses and positively biased activation functions. We prove that under MSE or cross-entropy loss, the gradient with respect to positive pre-activations is non-negative in expectation at initialization, driving downstream weights toward negative values during early training. The drift is intrinsic to optimization rather than data, and persists across architectures (MLP, ResNet, ViT, GPT-nano, MP-SENe) and asymmetric activation functions (ReLU, GELU, SiLU). Coupled with ReLU, weight drift produces activation sparsity reaching up to 90\% in GPT-nano. We characterize the sparsity-accuracy tradeoff across 79 configurations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
