Bug or Feature$^2$: Weight Drift, Activation Sparsity and Spikes

Egor Shvetsov; Aleksandr Serkov; Shokorov Viacheslav; Redko Dmitry; Vladislav Goloshchapov; Evgeny Burnaev

arXiv:2605.17659·cs.LG·May 22, 2026

Bug or Feature$^2$: Weight Drift, Activation Sparsity and Spikes

Egor Shvetsov, Aleksandr Serkov, Shokorov Viacheslav, Redko Dmitry, Vladislav Goloshchapov, Evgeny Burnaev

PDF

1 Repo

TL;DR

This paper investigates the intrinsic weight drift caused by positive activation functions in neural networks, revealing its impact on activation sparsity and proposing clipping techniques to improve training stability and performance.

Contribution

It provides a theoretical analysis of negative weight drift due to positive activation functions, characterizes its effects across architectures, and introduces clipping methods to mitigate adverse spikes.

Findings

01

Weight drift drives weights toward negative values during early training.

02

Activation sparsity can reach up to 90% in GPT-nano due to weight drift.

03

Clipping ReLU$^2$ mitigates activation spikes and improves validation loss.

Abstract

The design of modern neural architectures has converged through incremental empirical choices, yet the mechanisms governing their training dynamics remain only partially understood. We identify and analyze a negative weight drift induced by the interaction between standard losses and positively biased activation functions. We prove that under MSE or cross-entropy loss, the gradient with respect to positive pre-activations is non-negative in expectation at initialization, driving downstream weights toward negative values during early training. The drift is intrinsic to optimization rather than data, and persists across architectures (MLP, ResNet, ViT, GPT-nano, MP-SENe) and asymmetric activation functions (ReLU, GELU, SiLU). Coupled with ReLU, weight drift produces activation sparsity reaching up to 90\% in GPT-nano. We characterize the sparsity-accuracy tradeoff across 79 configurations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

On-Point-RND/BugOrFeature
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.