Training ML Models with Predictable Failures

Will Schwarzer; Scott Niekum

arXiv:2605.15134·cs.LG·May 18, 2026

Training ML Models with Predictable Failures

Will Schwarzer, Scott Niekum

PDF

TL;DR

This paper introduces a method for predicting ML model failures at deployment scale by extrapolating from evaluation set failures, analyzing forecast errors, and proposing a fine-tuning objective to improve safety predictions.

Contribution

It provides a finite-k decomposition of failure forecast error, reveals bias tendencies, and proposes the forecastability loss to enhance failure rate predictions.

Findings

01

Fine-tuning reduces forecast error in experiments.

02

The method maintains primary-task performance.

03

Safety predictions improve with the proposed loss.

Abstract

Estimating how often an ML model will fail at deployment scale is central to pre-deployment safety assessment, but a feasible evaluation set is rarely large enough to observe the failures that matter. Jones et al. (2025) address this by extrapolating from the largest k failure scores in an evaluation set to predict deployment-scale failure rates. We give a finite-k decomposition of this estimator's forecast error and show that it has a built-in bias toward over-prediction in the typical case, which is the safety-favorable direction. This bias is offset when the evaluation set misses a rare high-failure mode that the deployment set contains, leaving the forecast to under-predict at deployment scale. We propose a fine-tuning objective, the forecastability loss, that addresses this failure mode. In two proof-of-concept experiments, a language-model password game and an RL gridworld,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.