The Final Layer Holds the Key: A Unified and Efficient GNN Calibration Framework

Jincheng Huang; Jie Xu; Xiaoshuang Shi; Ping Hu; Lei Feng; Xiaofeng Zhu

arXiv:2505.11335·cs.LG·September 30, 2025

The Final Layer Holds the Key: A Unified and Efficient GNN Calibration Framework

Jincheng Huang, Jie Xu, Xiaoshuang Shi, Ping Hu, Lei Feng, Xiaofeng Zhu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a unified, efficient GNN calibration framework that improves confidence estimates by focusing on class-centroid and node-level calibration, reducing computational overhead and enhancing reliability.

Contribution

It provides a theoretical framework linking confidence calibration to class-centroid and node-level adjustments, proposing a simple method to improve GNN calibration without extra components.

Findings

01

Reduces GNN under-confidence by lowering weight decay in final layer

02

Node-level calibration improves confidence at a finer granularity

03

Method outperforms existing calibration techniques in experiments

Abstract

Graph Neural Networks (GNNs) have demonstrated remarkable effectiveness on graph-based tasks. However, their predictive confidence is often miscalibrated, typically exhibiting under-confidence, which harms the reliability of their decisions. Existing calibration methods for GNNs normally introduce additional calibration components, which fail to capture the intrinsic relationship between the model and the prediction confidence, resulting in limited theoretical guarantees and increased computational overhead. To address this issue, we propose a simple yet efficient graph calibration method. We establish a unified theoretical framework revealing that model confidence is jointly governed by class-centroid-level and node-level calibration at the final layer. Based on this insight, we theoretically show that reducing the weight decay of the final-layer parameters alleviates GNN…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 3

Strengths

1. This is the paper's most significant strength. It moves beyond heuristic-based calibration by providing a rigorous theoretical analysis. 2. The proposed SCAR method consistently outperforms a wide range of strong baselines across multiple datasets.

Weaknesses

1. The node-level calibration is refined in Eq. 10 to account for the structural bias of GNNs (nodes closer to training data get more similar representations). While this is a thoughtful addition, its evaluation is limited. An ablation study showing the performance gain of using two parameters $\alpha$ and $\beta$ over a single one would have strengthened this claim. 2. The details of the high-order neighbors of the training node is not well specified. 3. Sensitivity analysis on hyper-paramete

Reviewer 02Rating 4Confidence 5

Strengths

1. The authors are the first to theoretically show that final-layer weight decay aggravates GNN under-confidence, and they mitigate this by reducing the decay. 2. They propose a training-free node-level calibration method as a fine-grained complement to class-centroid-level calibration. 3. They develop a unified theoretical framework showing that both calibration levels jointly govern model confidence, and validate the method’s superiority across diverse settings.

Weaknesses

1. Missing important related work: Given that the paper focuses on confidence calibration, it is concerning that several key papers in the area of uncertainty estimation or calibration for GNNs are not cited or discussed [1-4]. 2. Limited baselines: The experimental comparisons would benefit from the inclusion of recent calibration methods [5] 3. Restricted backbone models: The authors only evaluate their method on GCN and GAT. While these are classical models, they are no longer sufficient to

Reviewer 03Rating 6Confidence 4

Strengths

- This paper provides a theoretical connection between underconfidence of GNNs and final layer’s weight decay, which is valuable given the lack of theoretical analysis in GNN calibration literature. - The proposed method is simple yet effective, avoiding the need to train additional calibration networks as required by many existing methods. - Extensive experiments shows that SCAR substantially reduces ECE compared to prior baselines, as well as maintaining original classification accuracy of GNN

Weaknesses

- The proposed node-level calibration assumes that pushing test nodes toward their predicted class centroids improves confidence, which may not hold under settings such as out-of-distribution (OOD) conditions. For instance, in OOD graphs, pushing test nodes toward centroids learned from training data can degrade calibration. - If the original GNNs are trained with zero weight decay, the proposed method may be partially inapplicable. - While SCAR is efficient, it needs to search the optimal confi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Graph Theory and Algorithms · Machine Learning in Healthcare

MethodsWeight Decay