Numerical influence of ReLU'(0) on backpropagation
David Bertoin (ISAE-SUPAERO), J\'er\^ome Bolte (TSE-R), S\'ebastien, Gerchinovitz (IMT), Edouard Pauwels (IRIT-ADRIA)

TL;DR
This study investigates how the derivative of ReLU at zero affects backpropagation and training across different precisions, revealing significant impacts at lower precisions and suggesting potential for parameter tuning.
Contribution
It demonstrates the influence of ReLU'(0) on training outcomes at various precisions and shows that common practices can be optimized by tuning this parameter.
Findings
ReLU'(0) significantly affects backpropagation at 16-bit precision.
Choosing ReLU'(0) = 0 improves training efficiency and accuracy.
Buffering effects of batch normalization and ADAM reduce ReLU'(0) influence.
Abstract
In theory, the choice of ReLU(0) in [0, 1] for a neural network has a negligible influence both on backpropagation and training. Yet, in the real world, 32 bits default precision combined with the size of deep learning problems makes it a hyperparameter of training methods. We investigate the importance of the value of ReLU'(0) for several precision levels (16, 32, 64 bits), on various networks (fully connected, VGG, ResNet) and datasets (MNIST, CIFAR10, SVHN, ImageNet). We observe considerable variations of backpropagation outputs which occur around half of the time in 32 bits precision. The effect disappears with double precision, while it is systematic at 16 bits. For vanilla SGD training, the choice ReLU'(0) = 0 seems to be the most efficient. For our experiments on ImageNet the gain in test accuracy over ReLU'(0) = 1 was more than 10 points (two runs). We also evidence that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Stochastic Gradient Optimization Techniques
MethodsDropout · Max Pooling · Convolution · Stochastic Gradient Descent · Softmax · Dense Connections · Adam
