Selective Prediction via Training Dynamics

Stephan Rabanser; Anvith Thudi; Kimia Hamidieh; Adam Dziedzic; Israfil Bahceci; Akram Bin Sediq; Hamza Sokun; Nicolas Papernot

arXiv:2205.13532·cs.LG·July 8, 2025·1 cites

Selective Prediction via Training Dynamics

Stephan Rabanser, Anvith Thudi, Kimia Hamidieh, Adam Dziedzic, Israfil Bahceci, Akram Bin Sediq, Hamza Sokun, Nicolas Papernot

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a training dynamics-based framework for selective prediction that improves accuracy and utility trade-offs without modifying model architecture or training objectives, applicable across various domains.

Contribution

The authors propose a novel, domain-agnostic method using training dynamics to enhance selective prediction performance without altering existing training procedures.

Findings

01

Outperforms state-of-the-art methods on image classification benchmarks.

02

Effective across classification, regression, and time series tasks.

03

Does not require changes to model architecture or training process.

Abstract

Selective Prediction is the task of rejecting inputs a model would predict incorrectly on. This involves a trade-off between input space coverage (how many data points are accepted) and model utility (how good is the performance on accepted data points). Current methods for selective prediction typically impose constraints on either the model architecture or the optimization objective; this inhibits their usage in practice and introduces unknown interactions with pre-existing loss functions. In contrast to prior work, we show that state-of-the-art selective prediction performance can be attained solely from studying the (discretized) training dynamics of a model. We propose a general framework that, given a test input, monitors metrics capturing the instability of predictions from intermediate models (i.e., checkpoints) obtained during training w.r.t. the final model's prediction. In…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 3· reject, not good enoughConfidence 4

Strengths

The points of strengths include: 1- The method works for several tasks including classification, regression, and time series. 2- The method seems to outperform the previous state-of-the-art selective classification methods. 3- Several experimental results presented

Weaknesses

The points of weaknesses include: 1- The proposed idea lacks novelty as it is very similar to using ensembles of models. The difference here is that the ensembles are generated on a fixed schedule from the training dynamics. 2- Checkpoints are chosen based on a fixed schedule which can correspond to models of bad performance. A better approach is to follow the approach from [Huang et al. 2017] which constructs an ensemble by choosing points of good performance using a cyclic learning rate. Us

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

I find the paper well written, clearly presenting each relevant concept and experiment. The method is simple, which facilitates its adoption by ML practitioners. The experiments are convincing.

Weaknesses

- The novelty of the method is limited, the ideas of re-using past checkpoints to form an ensemble can be found in e.g. [1] - The results for SPTD and Deep Ensemble (DE) are both relatively close to one another and it would be nice to derive conditions under which one method is expected to be better than the other. - It is unclear how the performance of SPTD is tied to optimization noise. Especially, regression experiments use full-batch gradient descent, how would the results evolve when usin

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

Looking at the training dynamic to gauge the prediction reliability at a test point is a refreshingly interesting idea. Despite its simple formulation, I consider the idea novel -- in fact, simplicity in implementation is a plus to me. The paper is also reasonably well-written. It is a pleasant to read this paper. All discussion points & experiment highlights are well-organized, which makes the core idea very digestible. I also appreciate the extensive results with a lot of ablation studies.

Weaknesses

Despite the above strengths, I still have a few doubts regarding the practicality of this paper: First, the results are presented in a way that gives the impression that one can control the coverage. How is it possible in practice? I understand that the threshold can be adjusted to meet a certain coverage level on the training set but I am not sure how we could do that for the unseen test set. In other words, I feel that setting tau algorithmically should be part of the solution. Second,

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)