The Resurrection of the ReLU

Co\c{s}ku Can Horuz; Geoffrey Kasenbacher; Saya Higuchi; Sebastian Kairat; Jendrik Stoltz; Moritz Pesl; Bernhard A. Moser; Christoph Linse; Thomas Martinetz; Sebastian Otte

arXiv:2505.22074·cs.LG·May 29, 2025

The Resurrection of the ReLU

Co\c{s}ku Can Horuz, Geoffrey Kasenbacher, Saya Higuchi, Sebastian Kairat, Jendrik Stoltz, Moritz Pesl, Bernhard A. Moser, Christoph Linse, Thomas Martinetz, Sebastian Otte

PDF

Open Access

TL;DR

This paper introduces SUGAR, a surrogate gradient method for ReLU that enhances its performance and sparsity, effectively reviving the classical activation function across various deep learning architectures.

Contribution

The paper proposes a novel surrogate gradient technique for ReLU, improving its generalization, sparsity, and effectiveness in modern deep learning models.

Findings

01

SUGAR improves generalization in CNN architectures like VGG-16 and ResNet-18.

02

SUGAR enhances sparsity and resurrects dead ReLUs effectively.

03

Replacing GELU with SUGAR yields competitive or better results in advanced models.

Abstract

Modeling sophisticated activation functions within deep learning architectures has evolved into a distinct research direction. Functions such as GELU, SELU, and SiLU offer smooth gradients and improved convergence properties, making them popular choices in state-of-the-art models. Despite this trend, the classical ReLU remains appealing due to its simplicity, inherent sparsity, and other advantageous topological characteristics. However, ReLU units are prone to becoming irreversibly inactive - a phenomenon known as the dying ReLU problem - which limits their overall effectiveness. In this work, we introduce surrogate gradient learning for ReLU (SUGAR) as a novel, plug-and-play regularizer for deep architectures. SUGAR preserves the standard ReLU function during the forward pass but replaces its derivative in the backward pass with a smooth surrogate that avoids zeroing out gradients. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Stochastic Gradient Optimization Techniques