Residual Connections and Normalization Can Provably Prevent Oversmoothing in GNNs

Michael Scholkemper; Xinyi Wu; Ali Jadbabaie; Michael T. Schaub

arXiv:2406.02997·cs.LG·January 21, 2026·1 cites

Residual Connections and Normalization Can Provably Prevent Oversmoothing in GNNs

Michael Scholkemper, Xinyi Wu, Ali Jadbabaie, Michael T. Schaub

PDF

Open Access 3 Reviews

TL;DR

This paper provides a theoretical analysis of how residual connections and normalization layers prevent oversmoothing in GNNs, introducing GraphNormv2 to improve message-passing without signal distortion.

Contribution

It offers a formal characterization of residual and normalization layers in GNNs, and proposes GraphNormv2 to enhance signal preservation during normalization.

Findings

01

Residual connections prevent features from becoming too smooth.

02

Batch normalization preserves the embedding space by rescaling features.

03

GraphNormv2 learns centering to avoid signal distortion.

Abstract

Residual connections and normalization layers have become standard design choices for graph neural networks (GNNs), and were proposed as solutions to the mitigate the oversmoothing problem in GNNs. However, how exactly these methods help alleviate the oversmoothing problem from a theoretical perspective is not well understood. In this work, we provide a formal and precise characterization of (linearized) GNNs with residual connections and normalization layers. We establish that (a) for residual connections, the incorporation of the initial features at each layer can prevent the signal from becoming too smooth, and determines the subspace of possible node representations; (b) batch normalization prevents a complete collapse of the output embedding space to a one-dimensional subspace through the individual rescaling of each column of the feature matrix. This results in the convergence of…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. It provides a thorough theoretical analysis of oversmoothing for residual connections and normalization layers. 2. The paper introduces a novel normalization layer that addresses a specific limitation of existing methods. The method is simple but effective. Random weights

Weaknesses

1. The analysis is conducted only on linearized GNNs (except Remark 4.6) with random weights. a. Most GNNs are nonlinear, so we are more concerned with the properties of GNNs under nonlinearity. Can the analysis be generalized to nonlinear cases? b. Random weights usually appear during initialization. A network may have bad properties when it is initialized and good properties when it is trained, so we are more concerned about the trained GNNs which weights are not random. 2. The re

Reviewer 02Rating 6Confidence 3

Strengths

1). The writing and organization of this paper are very clear. 2). Provided theoretical support for the use of residual connections and normalization layers in GNNs.

Weaknesses

1). The standard deviations for MUTAG, PROTEINS, PTC-MR, and Cora in Table 1 are quite large, making the experimental results less convincing, and why are the standard deviations for GIN so large? Why graphv2 does show significant improvement for GIN on the ogbn-arxiv dataset, but not for GCN and GAT? 2). Could you add a curve that shows the performance changes of different baselines and GraphNormv2 as the number of layers in GNNs increases? 3). There is an error in line 118.

Reviewer 03Rating 8Confidence 3

Strengths

The presentation is very clear, with a nice balance of theoretical investigations coupled with empirical evaluations. The exposure is gradual, with relevant connections to existing literature. I really enjoyed reading this paper. The technical results are detailed and extensive. The authors conduct experiments on various real-world datasets, and the presented results display convincing performance.

Weaknesses

The theoretical investigations consider the simplified setting of linearized GNNs. As mentioned in the paper, I acknowledge that you already consider bridging this gap in future work. Would it be possible to provide some insights into what specific challenges do you anticipate in extending the analysis to non-linear GNNs ? You show that residual connections and normalization layers help against oversmoothing through different mechanisms. Do you think it would it be possible to quantify which of

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOpportunistic and Delay-Tolerant Networks

MethodsBatch Normalization