Learning from Historical Activations in Graph Neural Networks

Yaniv Galron; Hadar Sinai; Haggai Maron; Moshe Eliasof

arXiv:2601.01123·cs.LG·May 19, 2026

Learning from Historical Activations in Graph Neural Networks

Yaniv Galron, Hadar Sinai, Haggai Maron, Moshe Eliasof

PDF

1 Video 3 Reviews

TL;DR

HISTOGRAPH is a novel attention-based pooling method for GNNs that leverages intermediate layer activations to improve graph classification, especially in deep architectures.

Contribution

Introduces HISTOGRAPH, a two-stage attention mechanism that utilizes historical activations across layers for enhanced graph pooling.

Findings

01

Consistently improves performance over traditional pooling methods.

02

Provides robustness in deep GNN architectures.

03

Enhances node representation by modeling activation evolution.

Abstract

Graph Neural Networks (GNNs) have demonstrated remarkable success in various domains such as social networks, molecular chemistry, and more. A crucial component of GNNs is the pooling procedure, in which the node features calculated by the model are combined to form an informative final descriptor to be used for the downstream task. However, previous graph pooling schemes rely on the last GNN layer features as an input to the pooling or classifier layers, potentially under-utilizing important activations of previous layers produced during the forward pass of the model, which we regard as historical graph activations. This gap is particularly pronounced in cases where a node's representation can shift significantly over the course of many graph neural layers, and worsened by graph-specific challenges such as over-smoothing in deep architectures. To bridge this gap, we introduce…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. Novel perspective: The paper introduces a clear and well-motivated idea of learning from the historical trajectory of node activations, addressing the common limitation of relying solely on the last GNN layer. 2. Comprehensive experiments: Evaluations across multiple datasets (TU, OGB, node classification, and link prediction) with both GIN and GCN backbones show consistent improvements. 3. Well-written and well-positioned: The paper situates HISTOGRAPH clearly within prior works on pooling

Weaknesses

1. Limited interpretability of learned attention weights: While attention is used layer-wise and node-wise, the paper could benefit from deeper analysis of what the model learns—e.g., visualization of layer weights across datasets. 2. The attention mechanism itself is widely adopted and not novel. However, the paper should further clarify why the proposed method achieves such notable performance gains. A deeper analytical discussion and illustrative case studies would substantially strengthen th

Reviewer 02Rating 2Confidence 5

Strengths

- It is clear that the authors have spent a lot of effort in the experimental section as they compare against a large number of baselines and consider a large number of datasets. - The proposed method can be easily included into existing architectures (at the cost of some training for the new parameters).

Weaknesses

- Global self-attention is quadratic in the number of nodes, which makes the method impractical for large graphs. - Caching in memory the activations at all layers for all nodes can become prohibitively expensive. Together with the above, this makes the proposed method very impractical for large graphs. - Section 4 is not very convincing as the arguments are too general. Regarding oversmoothing, Proposition 1 is obvious, and in practice different nodes might perform better with different alphas

Reviewer 03Rating 4Confidence 3

Strengths

1. The motivation is clear; the authors propose leveraging historical representations to mitigate over-smoothing, which is reasonable and well-justified. 2. The experiments are comprehensive, thoroughly validating the effectiveness of their method across various tasks.

Weaknesses

1. Lacks comparison with some more recent baselines [1]. 2. No experimental comparisons were conducted on larger graphs, such as those in the OGB [2] suite. How does the time efficiency compare to the baseline when the graph size increases? 3. How are the historical representations specifically utilized? What are the theoretical advantages of the gating mechanism? 4. Lacks a theoretical analysis of the method's effectiveness. [1] Wang Y, Liu S, Zheng T, et al. Unveiling global interactive patte

Videos

Learning from Historical Activations in Graph Neural Networks· slideslive

Taxonomy

TopicsAdvanced Graph Neural Networks · Graph Theory and Algorithms · Machine Learning in Healthcare