# Optimized Weight Initialization on the Stiefel Manifold for Deep ReLU Neural Networks

**Authors:** Hyungu Lee, Taehyeong Kim, Hayoung Choi

arXiv: 2509.00362 · 2025-09-03

## TL;DR

This paper proposes an orthogonal weight initialization method on the Stiefel manifold tailored for deep ReLU networks, improving stability and performance by better controlling activation and gradient flow.

## Contribution

It introduces a novel optimization-based orthogonal initialization on the Stiefel manifold, specifically designed for ReLU networks, with theoretical analysis and practical benefits.

## Key findings

- Outperforms previous initializations on multiple datasets
- Enables stable training of very deep ReLU networks
- Reduces dying ReLU and gradient vanishing issues

## Abstract

Stable and efficient training of ReLU networks with large depth is highly sensitive to weight initialization. Improper initialization can cause permanent neuron inactivation dying ReLU and exacerbate gradient instability as network depth increases. Methods such as He, Xavier, and orthogonal initialization preserve variance or promote approximate isometry. However, they do not necessarily regulate the pre-activation mean or control activation sparsity, and their effectiveness often diminishes in very deep architectures. This work introduces an orthogonal initialization specifically optimized for ReLU by solving an optimization problem on the Stiefel manifold, thereby preserving scale and calibrating the pre-activation statistics from the outset. A family of closed-form solutions and an efficient sampling scheme are derived. Theoretical analysis at initialization shows that prevention of the dying ReLU problem, slower decay of activation variance, and mitigation of gradient vanishing, which together stabilize signal and gradient flow in deep architectures. Empirically, across MNIST, Fashion-MNIST, multiple tabular datasets, few-shot settings, and ReLU-family activations, our method outperforms previous initializations and enables stable training in deep networks.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2509.00362/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/2509.00362/full.md

## References

46 references — full list in the complete paper: https://tomesphere.com/paper/2509.00362/full.md

---
Source: https://tomesphere.com/paper/2509.00362