Never Skip a Batch: Continuous Training of Temporal GNNs via Adaptive Pseudo-Supervision
Alexander Panyshev, Dmitry Vinichenko, Oleg Travkin, Roman Alferov, Alexey Zaytsev

TL;DR
This paper introduces HAL, a method that accelerates training of Temporal Graph Networks by using pseudo-labels from historical data, reducing gradient variance and enabling continuous updates, validated on TGB with up to 15x speedup.
Contribution
The paper presents HAL, a novel approach that leverages historical pseudo-labels to improve training efficiency of temporal GNNs without architectural changes.
Findings
HAL accelerates TGNv2 training by up to 15x.
Using pseudo-labels reduces gradient variance and speeds convergence.
HAL maintains competitive performance while improving training speed.
Abstract
Temporal Graph Networks (TGNs), while being accurate, face significant training inefficiencies due to irregular supervision signals in dynamic graphs, which induce sparse gradient updates. We first theoretically establish that aggregating historical node interactions into pseudo-labels reduces gradient variance, accelerating convergence. Building on this analysis, we propose History-Averaged Labels (HAL), a method that dynamically enriches training batches with pseudo-targets derived from historical label distributions. HAL ensures continuous parameter updates without architectural modifications by converting idle computation into productive learning steps. Experiments on the Temporal Graph Benchmark (TGB) validate our findings and an assumption about slow change of user preferences: HAL accelerates TGNv2 training by up to 15x while maintaining competitive performance. Thus, this work…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper proposes a simple but effective pseudo-labeling method based on exponential moving averages of past labels. This approach reduces label sparsity and allows continuous training on temporal graphs. 2. The authors provide a clear theoretical analysis proving faster SGD convergence under historical label aggregation, with a quantified improvement factor of min(h, k). This adds rigor and supports the method’s validity. 3. The approach is implemented on both TGNv2 and a modified DyRep v
1. In Section 2.2, the paper states: “For each batch Bt we compute pseudo-targets only for nodes v participating in Bt.” It is unclear what “unlabeled” means for these nodes — are they naturally without supervision at this timestep, or is this due to missing ground truth? If it is the former, using historical pseudo-labels might distort the temporal dynamics of infrequent or slowly changing nodes, whose past labels may no longer represent their current state. Could this affect model stability or
1. Practical and timely problem. Temporal GNNs indeed suffer from sparse supervision, where many batches lack labels. Addressing this inefficiency is valuable for real-world streaming systems. 2. Implementation simplicity. The proposed pseudo-labeling (HA/MA/PF) is easy to integrate into existing TGN pipelines, which enhances reproducibility. 3. Initial theoretical analysis. The paper attempts to formalize the benefit of pseudo-labels through reduced gradient variance, offering a conceptual link
1. Limited novelty. The contribution lies mainly in adding pseudo-label updates to existing temporal GNNs (TGNv2, DyRep). No new architecture, optimization mechanism, or learning paradigm is introduced. And the proposed pseudo-labeling methods are kind similar to the “Moving Average” method mentioned in paper Temporal Graph Benchmark for Machine Learning on Temporal Graphs. 2. Pseudo-label initialization unclear. When there is insufficient history (early timesteps), how are pseudo-labels initial
1) The paper Introduces a simple yet novel approach to handle sparse supervision in temporal GNNs via History-Averaged Labels, enabling continuous training even on unlabeled batches. 2) It adapts historical averaging concepts from time-series forecasting to pseudo-labeling in dynamic graphs. 3) The method is easy to integrate into existing models without architectural changes. 4) Well-written, logically organized, and supported by informative figures and ablation studies. 5) Addresses an imp
1) The experiments are restricted to only two architectures of similar type (TGNv2 and DyRepv2), which limits the evidence for generality. Broader testing across diverse temporal GNN frameworks would better support the “architecture-agnostic” claim. 2) The fact that the method achieves strong performance using only 5% of the training data is encouraging and highlights its data efficiency. However, the presentation of this result is somewhat confusing and potentially misleading, as it is framed
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Machine Learning in Healthcare · Innovative Human-Technology Interaction
