Failure Prediction Is a Better Performance Proxy for Early-Exit Networks Than Calibration
Piotr Kubaty, Filip Szatkowski, Metod Jazbec, Bartosz W\'ojcik

TL;DR
This paper argues that failure prediction is a more reliable indicator than calibration for assessing early-exit networks' performance, as it better correlates with efficiency improvements.
Contribution
The authors demonstrate that calibration metrics can be misleading and propose failure prediction as a superior proxy for early-exit model performance evaluation.
Findings
Miscalibrated networks can outperform calibrated ones.
Failure prediction correlates strongly with efficiency gains.
Calibration metrics may not reflect true model performance.
Abstract
Early-exit models accelerate inference by attaching internal classifiers to intermediate layers of the network, allowing computation to halt once a prediction meets a predefined exit criterion. Most early-exit methods rely on confidence-based exit strategies, which has motivated prior work to calibrate intermediate classifiers in pursuit of improved performance-efficiency trade-offs. In this paper, we argue that calibration metrics can be misleading indicators of multi-exit model performance. Specifically, we present empirical evidence showing that miscalibrated networks can outperform calibrated ones. As an alternative, we propose using failure prediction as a more informative proxy for early-exit model performance. Unlike calibration, failure prediction captures changes in sample rankings and correlates strongly with efficiency gains, offering a more reliable framework for designing…
| FLOPs threshold | ECE() | EEFP() | |||
| 25% | 50% | 75% | |||
| MSDNet - CIFAR-100 | |||||
| Baseline | 0.21 | 0.82 | |||
| Calibrated | 0.02 | 0.83 | |||
| Ours | 73.27 | 75.66 | 75.58 | 0.15 | 0.86 |
| ViT Small - ImageNet1000 | |||||
| Baseline | 0.16 | 0.81 | |||
| Calibrated | 0.02 | 0.82 | |||
| Ours | 40.18 | 72.45 | 80.41 | 0.15 | 0.84 |
| EfficientNet B2 - ImageNet1000 | |||||
| Baseline | 0.18 | 0.74 | |||
| Calibrated | 0.02 | 0.76 | |||
| Ours | 34.50 | 63.46 | 77.57 | 0.15 | 0.76 |
| FLOPs threshold | EEFP() | |||
| 25% | 50% | 75% | ||
| top-1 | 0.74 | |||
| top-2 | 0.78 | |||
| top-5 | 50.67 | 64.06 | 67.03 | 0.78 |
| top-10 | 50.67 | 0.78 | ||
| top-200 | 0.76 | |||
| FLOPs threshold | EEFP() | |||
| 25% | 50% | 75% | ||
| Baseline | 0.73 | |||
| Calibration | 0.74 | |||
| W/o History | 50.82 | 0.77 | ||
| With History | 64.06 | 67.03 | 0.78 | |
| Model | FLOPs |
|---|---|
| Baseline EENN | 4 653 372 423 |
| Ours | 4 653 433 635 |
| Class 1 | Class 2 | Class 3 | Class 4 | |
|---|---|---|---|---|
| Image | -0.7985 | -0.9163 | -2.3026 | -2.9957 |
| Image | -1.6094 | -1.6094 | -0.9163 | -1.6094 |
| Image | -1.2040 | -0.9676 | -1.3471 | -2.8134 |
| Class 1 | Class 2 | Class 3 | Class 4 | |
|---|---|---|---|---|
| Image | 0.450 | 0.400 | 0.100 | 0.050 |
| Image | 0.200 | 0.200 | 0.400 | 0.200 |
| Image | 0.300 | 0.380 | 0.260 | 0.060 |
| FLOPs threshold | ECE() | EEFP() | |||
| 25% | 50% | 75% | |||
| MSDNet - CIFAR-100 | |||||
| Baseline | 0.21 | 0.82 | |||
| Calibrated | |||||
| Ours | |||||
| ViT Small - ImageNet1000 | |||||
| Baseline EE | |||||
| Calibrated | |||||
| Ours | |||||
| EfficientNet B2 - ImageNet1000 | |||||
| Baseline EE | 0.18 | 0.74 | |||
| Calibrated | |||||
| Ours | |||||
| FLOPs threshold | ECE() | EEFP() | |||
|---|---|---|---|---|---|
| 25% | 50% | 75% | |||
| Baseline | |||||
| Calibrated | |||||
| W/o History | |||||
| With History | |||||
| FLOPs threshold | ECE() | EEFP() | |||
|---|---|---|---|---|---|
| 25% | 50% | 75% | |||
| Baseline | |||||
| Temperature Scaling | |||||
| Vector Scaling | |||||
| Matrix Scaling | |||||
| Ours top-1 | |||||
| Ours top-2 | |||||
| Ours top-5 | |||||
| Ours top-10 | |||||
| Ours top-200 | |||||
| CCCT top-1 | |||||
| CCCT top-2 | |||||
| CCCT top-5 | |||||
| CCCT top-10 | |||||
| CCCT top-200 | |||||
| FLOPs threshold | ECE() | EEFP() | |||
| 25% | 50% | 75% | |||
| Domain Real | |||||
| Baseline | |||||
| Calibration | |||||
| Ours | |||||
| Domain Clipart | |||||
| Baseline | |||||
| Calibration | |||||
| Ours | |||||
| Domain Infograph | |||||
| Baseline | |||||
| Calibration | |||||
| Ours | |||||
| Domain Painting | |||||
| Baseline | |||||
| Calibration | |||||
| Ours | |||||
| Domain Quickdraw | |||||
| Baseline | |||||
| Calibration | |||||
| Ours | |||||
| Domain Sketch | |||||
| Baseline | |||||
| Calibration | |||||
| Ours | |||||
| FLOPs threshold | ECE() | EEFP() | |||
| 25% | 50% | 75% | |||
| Gradient Equilibrium | |||||
| Baseline | |||||
| Calibrated | |||||
| Ours | |||||
| Zero Time Waste | |||||
| Baseline | |||||
| Calibrated | |||||
| Ours | |||||
| Early-Exit Distillation | |||||
| Baseline | |||||
| Calibrated | |||||
| Ours | |||||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Software System Performance and Reliability · Explainable Artificial Intelligence (XAI)
Rethinking Calibration for Early-Exit Neural Networks
Piotr Kubaty
Filip Szatkowski
Grzegorz Choczyński
Eric Nalisnick
Bartosz Wójcik
Abstract
Early-exit neural networks (EENNs) accelerate inference by allowing intermediate classifiers to stop computation once predictions are confident enough. Most methods rely on confidence thresholds for exiting, and consequently, improving classifier calibration is widely assumed to improve performance. In this work, we challenge this assumption and show that calibration alone is not sufficient for EENNs to exploit adaptive computation. To address this insufficiency, we introduce Early-Exit Failure Prediction (EEFP), which accounts for both prediction correctness and the cost of further computation. We also propose a lightweight, EEFP-motivated procedure to improve the intermediate classifiers, which can directly replace calibration in EENNs. Extensive experiments demonstrate that our approach achieves superior cost-accuracy trade-offs compared to calibration, and EEFP more reliably reflects overall EENN performance. Our code is available at https://github.com/gmum/rethinking-calibration-for-eenns.
Early-Exits, Calibration, Failure Prediction, Computer Vision, Machine Learning
1 Introduction
The rapid growth of deep learning has led to an increasing demand for resource-efficient models. Early-exit neural networks (EENNs) address this challenge by allowing a model to dynamically process input samples and assign less computation to easy samples, thereby reducing the average inference cost. EENNs enable inference to terminate early by attaching auxiliary classifiers to intermediate layers and halting computation once the prediction is thought to be either correct or sufficiently good as to not make further computation worthwhile. Early-exit mechanisms were originally introduced in the context of vision models (Lee et al., 2015; Larsson et al., 2017; Panda et al., 2016; Teerapittayanon et al., 2016; Huang et al., 2018; Kaya et al., 2019), and have since emerged as a natural solution for resource-constrained scenarios (Wang et al., 2019; Tambe et al., 2021; Fang et al., 2020; Ghodrati et al., 2021; Laskaridis et al., 2020; Shao et al., 2021; Li et al., 2020; Jazbec et al., 2023). More recently, early-exit architectures have been successfully adopted in natural language processing (Zhou et al., 2020; Liao et al., 2021; Schuster et al., 2022; Wójcik et al., 2023; Jazbec et al., 2024b), including reasoning models (Yang et al., 2025; Jiang et al., 2025).
The most common early-exit strategy relies on prediction confidence, allowing the network to stop inference when an intermediate classifier produces a prediction whose confidence exceeds a predefined threshold. This introduces a trade-off between prediction quality and computational cost, where higher thresholds generally improve the network accuracy at the expense of increased computation. Naturally, ensuring that exit decisions are correct improves the cost-accuracy trade-off inherent to such dynamic inference systems. As a result, a substantial body of prior work has equated reliable exit decisions with well-calibrated confidence estimates, and in turn, has focused on improving the calibration of EENN classifiers under the assumption that better calibration improves overall performance (Qendro et al., 2021; Lilli et al., 2021; Pacheco et al., 2023; Meronen et al., 2024; Mofakhami et al., 2024; Hao et al., 2026).
However, in this work, we demonstrate that this assumption is not entirely correct. We provide a theoretical investigation of calibration in EENNs and show how the motivations for employing EENNs are at odds with the standard notions of calibration. In particular, calibration of individual exits fails to account for the sequential structure of early-exit classifiers. We show that calibration alone—without additional assumptions relating to exit ordering—cannot reliably improve EENN performance.
As an alternative to calibration, we take inspiration from failure prediction (Corbière et al., 2019) and propose Early-Exit Failure Prediction (EEFP). Crucially, unlike a calibrated exit, an EEFP-aligned exit is structure aware: it considers not only the correctness of the current classifier but also the potential correctness of the subsequent classifiers. We introduce a lightweight EEFP-based confidence-correction procedure that serves as an effective alternative to exiting via classifier confidence. Figure˜1 conceptually compares our method to confidence-based exiting.
Across a range of standard benchmarks, we demonstrate that networks corrected with our approach consistently achieve better cost–accuracy characteristics than their confidence-calibrated counterparts. Importantly, we also show how our EEFP score reliably captures differences in quality between different EENNs: networks whose classifiers achieve higher EEFP scores perform better, while calibration metrics fail to reflect such differences.
Our contributions are as follows:
- •
We provide a theoretical analysis showing that standard notions of calibration, developed for isolated classifiers, are insufficient for EENNs.
- •
We address the limitations of calibration by adapting failure prediction for EENNs and introducing the Early-Exit Failure Prediction (EEFP) score.
- •
We design an EEFP-aligned confidence-correction procedure that improves cost-accuracy trade-offs in EENNs across multiple benchmarks.
2 Background
In this section, we give the necessary background information on early-exit neural networks (EENNs) (2.1) and classifier calibration (2.2).
2.1 Early-Exit Neural Networks
EENNs extend standard neural networks, endowing them with intermediate classifiers that perform sequential predictions during inference. They can be represented as a sequence of models , each of which maps from a multivariate feature space to a -dimensional simplex: . We assume a -class classification problem with labels taking the form of class indices and denoted . The model’s distribution over class labels at step is then , with denoting the classifier’s confidence assigned to label . Assume the data is generated from unknown distributions and , which we can infer only through i.i.d. samples.
Assume that each sub-model of an EENN has an associated computational cost s.t. , and this overhead is constant across exits (i.e. roughly uniform parameter count between exits): . A very common method for determining when to exit in EENNs is via confidence thresholding: the user must specify a threshold , and the network will terminate at the -th step if . The choice of represents the user’s preferences for the tradeoff between computation and accuracy: lower reduces the computation at the cost of accuracy, while higher increases the accuracy and compute.
2.2 Calibration
Assume for now we are in the simpler case of having only one classifier , defined just as the EENN component classifiers above. Let be a grouping function that partitions the feature space by assigning each point to a group indexed by . Thus the equivalence classes of are simply the points in feature space that are assigned to the same group: . We can now define a general notion of calibration (Dawid, 1982):
Definition 2.1**.**
Calibration: A predictor is (first-order) calibrated w.r.t. a choice of if
[TABLE]
This means that is calibrated if the confidence it assigns to every pair equals the average probability of that label within the group to which is assigned. The choice of naturally gives rise to stronger and weaker forms of calibration (Vaicenavicius et al., 2019), and properly choosing is known as the reference class problem (Hájek, 2007).
The strongest form of calibration is canonical calibration, which is when the predictor is actually the Bayes predictor, having matched all the underlying conditional probabilities.
Definition 2.2**.**
Canonical Calibration: A predictor is canonically calibrated if it is calibrated w.r.t. the identity function :
Canonical calibration is ‘strong’ because the grouping function is ‘fine grained’, creating a unique group for every point in feature space. This strength propagates into downstream decision-making: making per-instance decisions by thresholding the predictor (e.g. ?) is optimal for controlling Bayes risk since one is thresholding based on the actual conditional probabilities. On the other hand, we can create a weak notion of calibration by considering a ‘coarse grained’ grouping function. For example, consider a binary classification problem with balanced classes. The prediction is marginally calibrated since it has matched the class base rate. Yet, while the model is calibrated in this weak sense; of course, this is not a good predictive model since it is maximally ambivalent for every feature vector. Here we can also see that making decisions based on thresholding is nonsensical since the model has a constant output: there is no variability in decision making, as the same decision would be taken for all of .
In practice, we usually choose a grouping function that is in between the two fine-to-coarse extremes above. In the Machine Learning literature, classifiers are usually evaluated based on confidence calibration (Guo et al., 2017), which is when the grouping function is based on the predictive model itself—specifically, by the predictor’s maximum confidence over classes.
Definition 2.3**.**
Confidence Calibration: Let . A predictor is confidence calibrated if it is calibrated w.r.t. :
[TABLE]
This choice of group function is rather natural and can be captured with the intuition: over all the cases for which the predictor is confident in the modal class, is the predictor’s accuracy ? Confidence calibration is the variant that is of most interest to us. This is because the standard EENN exiting strategy of , mentioned above, is closely related to the grouping function that defines confidence calibration. A classifier’s distance from being calibrated is quantified by expected calibration error (ECE). ECE for confidence calibration is computed as
[TABLE]
where again . Yet in practice, with being a floating point number, it is unlikely that there will be groups of sufficient size to estimate . Thus we must resort to discretizing the interval into ‘bins’ (e.g. ), which should be done adaptively to the data at hand (Kumar et al., 2019).
3 The Interplay Between Calibration and Adaptive Compute in EENNs
We are now ready to study EENNs through the lens of calibration and investigate how the particular structure and assumptions underlying EENN interact with calibration.
3.1 Trivial and Non-Trivial Early-Exit NNs
We begin with a definition of a useless EENN.
Definition 3.1**.**
Trivial EENN: We call an EENN ‘trivial’ if all of its constituent predictive models are identical:
[TABLE]
If all the constituent predictive models return the same probability vector for all inputs, then clearly there is no reason to ‘keep around’ all exits. Instead, the modeler should just use (assuming it has the smallest computational overhead) to make all predictions and discard the other sub-models. One may also consider ‘partially trivial’ EENNs, which would have some but not total redundancy in its sub-classifiers. For our theoretical treatment, we assume that all EENNs are either trivial or non-trivial, since any partially trivial EENN can have its redundant sub-models pruned in order to make it non-trivial.
We can now show our first result: that canonically calibrated EENNs are trivial EENNs.
Proposition 3.2**.**
Canonically Calibrated EENNs are Trivial: If all constituent models of an EENN are canonically calibrated, then the EENN is trivial.
Proof: By the definition of canonical calibration, . Thus all sub-models are encoding the same distribution, the Bayes classifier .
While Proposition 3.2 encodes a best-case scenario that would all but never occur in practice, it shows that for an EENN to be non-trivial, we expect model misspecification. Moreover, the sub-models should be misspecified in unique ways in order to justify keeping all exits in the EENN. However, Proposition 3.2 is not necessarily true for other forms of calibration.
Proposition 3.3**.**
Calibration Does Not Imply a Trivial EENN: Assume an EENN is calibrated w.r.t. grouping function for all sub-models. If there exists an equivalence class of size 2 or larger with unique evaluations of , then the EENN is not necessarily trivial.
Proof: Assume an EENN is calibrated w.r.t. and is trivial. For an equivalence class containing and such that , let the predictive distributions encoded by the th subclassifier be switched, meaning and . This new EENN is still calibrated w.r.t. —since permuting ’s values within an equivalence class leaves the expectation unchanged—but now encodes a different predictive model than the other subclassifiers. Thus calibration can be preserved while making the EENN non-trivial. For example, a trivial EENN for a balanced binary prediction task could be marginally calibrated if all exits output for half of the instances and for the other half. Exchanging some instances for the th exit to output the opposite value preserves marginal calibration while making the th exit encode a different model from the others.
3.2 EENNs with Meaningful Adaptive Computation
Now that we have established that an EENN can be trivial under some forms of calibration (e.g. canonical calibration), but not all, the natural question is: what are properties that we should be looking for in a calibrated EENN? To answer this question, we must first define the concept of excess conditional risk:
Definition 3.4**.**
Excess Conditional Risk: Assume the log loss. Let be the risk under the model at exit , and let be the Bayes conditional risk. The excess risk is: with being non-negative and , the irreducible uncertainty in the generative process.
This notion of excess risk is important since, recalling Proposition 3.2, an EENN’s exits must be different from the Bayes classifier in order for it to be non-trivial! In other words, there must be excess risk, and we are interested in how it is allocated across the exits. Now with this notion of suboptimality (w.r.t the Bayes classifier), we define the class of ideal EENNs whose excess risk reduces with each exit.
Definition 3.5**.**
Conditional Anytime EENN: We say an EENN has a conditional anytime property if the excess conditional risk is strictly decreasing with each exit:
[TABLE]
We call these ‘conditional anytime’ EENNs since they have the anytime property of using more compute results in lower risk for each test point (Zilberstein, 1996; Jazbec et al., 2023), as the EENN is approaching the Bayes classifier with depth. In turn, any subsequent exit is superior to the current one, making the exiting decision entirely a function of one’s computational budget. However, most EENNs exhibit only a marginal anytime property (Jazbec et al., 2023), meaning that excess risk decreases not for each point but just on average over : .
Now assume a non-trivial EENN is calibrated with a different grouping function at each exit: . The excess risk at each exit can then be characterized by how informative is about y within the group .
Proposition 3.6**.**
Excess Risk for Calibrated Exits: Assume an EENN is non-trivial and calibrated at each exit w.r.t grouping functions . Then the EENN’s group-conditional excess risk at exit is:
[TABLE]
where denotes mutual information.
This result shows that a calibrated exit controls excess risk but only on average over ’s group. Mutual information decreases as excess risk decreases: provides less and less predictive information than what is already encoded in the grouping function. However, this result alone tells us nothing about an EENN having an anytime property. For that to occur, we need to have the grouping functions structured across the exits.
Theorem 3.7**.**
Grouping Refinement Reduces Excess Risk Across Exits: If the grouping functions form a refinement chain , then for every pair of exits such that , we have:
[TABLE]
Non-negativity of conditional mutual information then gives a group-conditional anytime property.
Proof.
By Proposition 3.6, , and analogously , using so that already determines . The chain rule for conditional mutual information yields , and subtracting gives the claim. ∎
The key takeaway of this result (and this paper) is that it is not enough for an EENN to be calibrated. Rather, it must have its calibration be well-structured such that later partitions refine earlier ones: . This is the vital property that links computation and calibration.
Confidence Calibration
Next we consider confidence calibration, a focus of both the machine learning and EENN literatures. We show that it fails to control the excess risk:
Proposition 3.8**.**
Confidence Calibration Does Not Control Log-Loss-Based Risk: Assume . For any confidence level , there exist confidence-calibrated predictors with arbitrarily large under log-loss.
Proof: Consider any for which some non-modal class has positive probability under the true generative process, . Modify on this input by setting and redistributing the remaining confidence over the other non-modal classes. The modal-class probability is unchanged—so the modified predictor is still confidence calibrated—but the pointwise log-loss now contains the term , which diverges as . Hence and , while confidence calibration is preserved for every . In Proposition 3.6, calibration w.r.t. pins down the entire predictive distribution on each equivalence class, so excess risk is controlled (by mutual information). Confidence calibration, in contrast, controls only one of the output coordinates. Since log-loss depends on whichever coordinate matches the true label, the excess risk is not bounded at all. The EENN can therefore fail to satisfy a group-conditional anytime property even though every exit is perfectly confidence calibrated.
However, the failure mode of Proposition 3.8 is specific to log-loss, and if we weaken the excess risk such that it only considers the modal class, then confidence calibration can control the excess risk. Below we consider , the excess risk under the [math]--loss.
Lemma 3.9**.**
Confidence Calibration Controls 0-1 Excess Risk: Assuming is confidence calibrated at level , the group-conditional [math]--excess risk satisfies
[TABLE]
Proof: The 0-1 risk at depends on only through , namely . Averaging over the equivalence class of and applying confidence calibration, , gives . The Bayes 0-1 risk averaged over the class is , and subtracting yields the stated excess. Even though confidence calibration can control excess risk under an appropriate (matching) choice of loss function, we still need the additional, structural ordering of the grouping function in order for the EENN to support anytime computation.
4 Early Exit Failure Prediction (EEFP)
As argued in the previous section, classifier calibration that is localized per-exit is ill-suited to EENNs, as its underlying assumptions are fundamentally misaligned with EENN design. Calibration, without inter-exit structural assumptions, ignores the computational cost induced by subsequent exit decisions. Therefore, we next consider an alternative prediction problem that explicitly accounts for the utility of subsequent classifiers in the exit decision. We show that when this alternative meta-model is calibrated, it contains the necessary awareness of the inter-exit quality.
4.1 Failure Prediction for EENNs
We begin by formulating the ideal exiting criteria for EENNs. Given a true label y and prediction of the -th classifier in the EENN, the optimal stopping decision for exit , denoted , is:
[TABLE]
This means that the EENN should exit at the -th classifier () if its prediction is correct or if none of the subsequent classifiers would be correct. Consequently, indicates that exiting at depth is predictively and computationally optimal: processing the sample with deeper classifiers cannot produce a better prediction with less computation. We refer to this task as failure prediction (Corbière et al., 2019; Hendrycks and Gimpel, 2016) to emphasize its awareness of computation.
Failure prediction is challenging as it must be done without access to the true label and without obtaining the predictions of future exits. Yet this prediction task should be tractable in many cases, as it does not require to know exactly what label the future exits will predict. Rather, it only requires inferring the binary variable of whether some unspecified future exit(s) will be correct. A similar assumption is employed by the learning to defer literature (Mozannar and Sontag, 2020; Verma et al., 2023) for modeling human prediction quality without having to fully encode the human’s behavior within a model.
Calibration
Given a fixed EENN and prediction task, the underlying generative model for failure prediction is:
[TABLE]
and . Thus denotes the probability of the predictions sampled from colliding with the true label sampled from . Now assume the EENN is calibrated w.r.t. grouping functions . The next result links the anytime property to a monotonically-increasing probability of stopping:
Proposition 4.1**.**
Refinement Implies Monotone Stopping: Assume the setting of group-conditional anytime computation: each exit is calibrated w.r.t and the partitions are ordered by refinement . Then the EEFP-style stopping probability is non-decreasing in , in expectation over each equivalence class of the coarser exit:
[TABLE]
The proof is provided in Appendix A. While failure prediction is sensitive to the partition-granularity of the current exit vs future ones, its view of future exits is in aggregate. In other words, the probability of stopping is the same regardless of whether the correct prediction is expected to arrive at the next exit or the very last one. To achieve this level of sensitivity, one must modify the product so that it is not permutation invariant. We leave this as a topic for future work.
4.2 Meta-Classifier for Failure Prediction
We propose a post-training method for EENNs that directly predicts the optimal stopping decision . Instead of relying on standard calibration methods such as temperature calibration, which ignore downstream computation, we train lightweight meta-models to estimate the probability of stopping at a given depth. Based on the predictions of the current classifier, these modules model the optimal stopping decision by outputting a scalar stopping probability. This approach therefore provides an alternative confidence score that explicitly accounts for inter-exit dynamics.
Our first implementation operates using the exit classifier’s confidences as inputs:
[TABLE]
where is the stopping decision of the -th exit for the -th sample, and denotes the top--ranked (e.g. ) class indices according to a descending ordering of ’s confidences. In our experiments, is implemented as a multi-layer perceptron (MLP).
Our second implementation leverages the predictions made at previous exits as additional input features:
[TABLE]
where denotes the top--ranked class indices as computed for the -th exit’s classifier confidences. We give this model the subscript for ‘history’, as it has the ability to see how the top--classes’ confidences evolved over the exits that have been evaluated so far. Again is implemented as an MLP whose architecture is the same except for the change in feature dimensionality.
To train both implementations, we first train the EENN of interest and freeze its weights. Then, for each exit , we train the corresponding failure prediction model using the binary cross-entropy loss on a held-out set. During training, predictions from all classifiers are collected to compute true stopping targets, as defined in Eq. (2). Once trained, ’s output confidences can be used for threshold-based exiting, just as the classifier confidences are typically used: if , then exit.
4.3 Metric for Failure Prediction
To measure the ability to predict failure, we propose an EEFP score for -th internal classifier using the observed value of , as defined in Equation˜2, and confidence scores (i.e. the -th exit’s score for the -th sample):
[TABLE]
where denotes the true label for the -th sample and the prediction of the -th exit for the -th sample. The confidence scores are computed in two ways, to compare EEFP with traditional classifier-confidence-based exiting. For the former, . For the latter, . AUROC measures the probability that a randomly chosen correct prediction receives a higher confidence score than a randomly chosen incorrect one, reflecting how well the confidence scores separate the true stopping decisions.
5 Experiments
We conduct experiments in standard computer vision settings and evaluate early-exit ResNet-34 (He et al., 2016), ViT-Tiny, ViT-Small (Dosovitskiy et al., 2021), EfficientNet (Tan et al., 2019), and MSDNet (Huang et al., 2018) on CIFAR100 (Krizhevsky and Hinton, 2009), TinyImageNet (Yang, 2015), and ImageNet-1k (Deng et al., 2009) datasets. We train baseline early-exit models from scratch, with the exception of the ImageNet models, for which we start from a pretrained checkpoint. We then freeze the weights of the resulting models and use them as the starting point for two post-training procedures: temperature calibration (described in detail in Appendix B and Appendix C) and an EEFP-inspired confidence correction method (described in detail in Appendix C). Unless stated otherwise, we use the “history” variant of confidence correction defined in Equation˜4, with in the operator as our primary method in all the analyzed settings.
We evaluate the methods by considering the compute–accuracy trade-off of the whole EENN, as well as the expected calibration error (ECE) and our proposed EEFP score obtained for the individual classifiers. To obtain the compute–accuracy characteristics, we dynamically evaluate the networks with varying exit thresholds and plot the average compute per sample (measured in FLOPs) and the classification accuracy corresponding to the thresholds. This enables us to analyze how each metric correlates with the end-to-end performance of EENNs and to assess their suitability as indicators of overall early-exit behavior.
5.1 Standard Benchmarks
To evaluate the effectiveness of our confidence correction and the usefulness of EEFP for comparing EENNs, we consider three variants of EENNs: baseline networks, temperature-calibrated ones, and EENNs enhanced with our confidence correction. Specifically, we train EENNs on top of ResNet-34 and ViT-Tiny architectures on TinyImageNet, and show the obtained results in Figures˜3 and 3.
The results are consistent across both architectures, and the networks with our proposed correction obtain clearly better cost-accuracy characteristics, although calibrated networks also achieve slight improvements over the baseline. Temperature calibration reduces ECE and slightly improves the EEFP score. In contrast, our confidence correction increases ECE while improving EEFP. Importantly, the relative quality of the models is accurately reflected by the EEFP scores of the internal classifiers, whereas ECE does not reliably predict end-to-end early-exit performance, supporting the validity of our proposed metric and approach.
We present additional results for MSDNet on CIFAR100, EfficientNet-B2 on ImageNet-1k and ViT-Small on ImageNet-1k in Table˜1 (standard deviations are reported in Table˜7), reporting averaged accuracies at selected compute budgets alongside the EEFP score and ECE averaged across internal classifiers. Consistent with earlier experiments, our confidence correction method achieves superior compute–accuracy trade-offs, and EEFP again provides a reliable prediction of model performance.
5.2 Impact of the Selection
Our confidence correction procedure (Section˜4.2) applies a operator over class predictions to reduce computational overhead. Therefore, the choice of naturally affects the performance of our method, with smaller reducing computational overhead, but skipping information potentially relevant to learning a reliable correction function. To assess the impact of and identify its most effective value, we perform an ablation study on ResNet34 trained on TinyImageNet and report the accuracy at selected FLOPs budgets and the average EEFP score across all classifiers in Table˜2 (standard deviations are reported in Table˜9).
Our approach achieves robust performance across different values. Interestingly, an EENN with a smaller even slightly outperforms a variant which uses all predictions (), likely due to the reduced complexity of the correction module leading to a regularization effect, and a slightly lower FLOPs per threshold. This effect can be further explained by the closer investigation of the inputs to the correction module: since the module operates on probability distributions produced by softmax, which sum to 1, only a few of the entries in the probability distribution are meaningfully larger than 0. Consequently, increasing beyond the first few entries feeds the correction module with useless values from the distribution tail, which may even introduce noise. Consistently with the previous experiments, higher EEFP scores in this ablation also continue to correspond to better cost-accuracy trade-offs. Overall, these results support the use of the operator in our correction approach.
5.3 Confidence Correction with Previous Predictions
We propose two variants of our confidence correction procedure, and , defined in Equation˜3 and Equation˜4 respectively. The key difference between the two approaches is that the first one uses only the predictions of the corresponding classifier as input, while the second additionally leverages information specific to early-exit networks: the history of all previous predictions up to the current classifier. To evaluate the impact of incorporating this historical information, we perform an ablation study of these two approaches, and compare the results in Table˜3.
Confidence correction with history improves accuracy at middle-to-high compute budgets, where information from previous classifiers can be effectively leveraged. At lower compute budgets, the simpler, historyless correction performs slightly better, likely because early classifiers are weak and their predictions are less informative. Nevertheless, the historyless variant still outperforms standard calibration. The compute-accuracy behavior of the compared models is consistently well captured by the EEFP score, as observed in the previous experiments.
5.4 Failure Cases of Calibrated EENNs
Finally, to further demonstrate the usefulness of the EEFP score, we highlight empirical cases where deliberately miscalibrated EENNs outperform the calibrated ones, and where standard calibration metrics such as ECE fail to correctly rank model performance in contrast to our proposed EEFP score. We calibrate the intermediate classifiers of MSDNet models trained on CIFAR100 using temperature scaling (Guo et al., 2017), and subsequently create two deliberately decalibrated variants by multiplying the temperature in the intermediate classifiers by 3.0 or 0.3, resulting in underconfident and overconfident models, respectively. We then evaluate cost–accuracy curves, the expected calibration error (ECE) of the intermediate classifiers, and their EEFP scores and show the results in Figure˜4. Overconfident models achieve a more favorable cost–accuracy trade-off despite substantially worse calibration scores; their performance, however, is accurately captured by the EEFP metric.
5.5 Computational Cost of Confidence Correction
One can easily estimate the cost of a single history variant confidence corrector. In our experiments, each corrector is a 2-layer MLP with hidden dim equal to . The size of the input for the -th corrector is . Since we use for top- by default, this gives as an estimate of MAC operations, e.g. for the -th exit. In Table˜4 we report the FLOPs as measured for the early-exit ResNet-34 models executed on 64x64 inputs. The computational cost of our confidence correctors is negligible.
6 Related Work
Early-Exit Neural Networks
Several works on EENNs have examined the calibration of intermediate classifiers as a way to improve performance (Pacheco et al., 2023; Mofakhami et al., 2024; Wójcik et al., 2023). Yet calibration failures have motivated alternative approaches for uncertainty quantification and consistency of the exiting policy, such as training a gating mechanism (Regol et al., 2024), post-hoc re-calibration for monotonicity (Jazbec et al., 2023), anytime-valid hypothesis testing (Jazbec et al., 2024a), distribution-free risk control (Schuster et al., 2022; Jazbec et al., 2024b), and Bayesian treatments of the classifiers (Meronen et al., 2024).
Model Cascades
Our analysis of calibration and EENNs equally applies to model cascades (Viola and Jones, 2001; Saberian and Vasconcelos, 2014; Marquez et al., 2018). Cascades, too, can be formulated as a sequence of models for which we expect later models to outperform earlier ones (to justify the additional computation). Thus, confidence calibration alone is not sufficient for a well-adaptive cascade, as has been pointed out by Jitkrittum et al. (2023) and Regol et al. (2025). However, our proposed remedy of failure prediction does not translate as well to cascades since the models in a cascade often have sizes that jump in orders of magnitude as the sequence progresses. In order to predict if the subsequent models will fail, the failure prediction model itself will need to be quite powerful.
7 Conclusion, Limitations, and Future Work
In this work, we challenge the assumption that confidence calibration of an EENN’s intermediate classifiers improves performance. We highlight that calibration does not innately account for computational cost. To address these limitations, we propose an approach based on failure prediction (EEFP). Critically, EEFP considers both prediction correctness and the cost of continuing inference—factors ignored by traditional formulations of calibration. We propose a lightweight meta-classifier that models the stopping probability under failure prediction and show that it consistently improves the efficiency of EENNs across diverse benchmarks.
The central limitation of our work is that we do not propose a method for better calibrating an EENN by imposing the structural ordering of the per-exit grouping functions required for anytime computation (Theorem 3.7). Furthermore, the grouping functions used in our theoretical analysis remained abstract and measuring their granularity, in practice, would be challenging and nearly impossible for high-dimensional problems. Yet, we hope this work will improve the field’s understanding of EENNs and inspire future research that guarantees an EENN will have the inter-exit structure that enables adaptive computation and early recognition of hopelessly difficult instances. Moreover, we believe our insights can also be applied to adaptive computation with large langauge models, such as in chain-of-thought reasoning (Wei et al., 2022; Wang et al., 2026).
Acknowledgements
We thank Metod Jazbec for helpful discussions. Filip Szatkowski was funded by National Science Centre (NCN, Poland) Grant No. 2022/45/B/ST6/02817.
Impact Statement
The goal of our paper is to advance the theoretical understanding and efficiency of early-exit neural networks. We aim for our research to contribute to the development of adaptive computation models that reduce overall computational cost, enabling broader deployment of neural networks in resource-constrained settings. Our methods are broadly applicable to neural network models and, more generally, to dynamic computation machine learning systems. While more efficient neural networks may have diverse societal consequences, we do not identify any specific impact that warrants highlighting here; we consider potential risks and impacts to be specific to particular applications of neural networks within subfields of computer science.
Appendix A Proof of Proposition 4.1
Proof.
Decompose
[TABLE]
with being the collision probability and being the “all remaining exits fail” product.
Step 1 (the term, pointwise). Since each factor lies in , for every , so .
Step 2 (the term, via refinement and Jensen). By calibration of , , so the tower property gives . The same calculation at exit , conditioned down to using , yields where the inequality is Jensen on the convex map .
Adding Steps 1 and 2 proves the claim. ∎
Appendix B Influence of Temperature Scaling
B.1 Probability distribution after temperature scaling
In a classification problem with classes, suppose the model outputs logits for a given data sample, and let denote the index of the most probable class. Without temperature scaling, the softmax probabilities are
[TABLE]
By introducing
[TABLE]
we can equivalently write
[TABLE]
When scaling logits by a temperature parameter , the probabilities become
[TABLE]
Therefore, the confidence changes from to
[TABLE]
Since the denominator depends on the entire probability distribution (and not only on ), two samples with the same original confidence can yield different scaled confidences after temperature scaling.
B.2 Temperature scaling does not preserve the ranking of samples
Consider a toy example with four classes. The classifier’s logit outputs are shown in Table 5, while the corresponding softmax probabilities are reported in Table 6. Confidences of the predictions are as follows: , , . Therefore the ranking of samples is (from the most to the least confident ones). However, when temperature changes, the ranking does as well. For example for temperature 0.3, , , , and the ranking changes to . On the other hand, for temperature 3.0, , , , and the ranking changes to .
Appendix C Experimental Setup Details
This section describes the experimental setup used throughout the experiments reported in the main paper.
C.1 Training
All models are trained using the AdamW optimizer (Loshchilov and Hutter, 2019). We employ a cosine annealing learning rate scheduler with warm restarts and a linear warm-up phase. Data augmentation includes random resizing, cropping, rotation, contrast adjustment, random erasing, Mixup (Zhang et al., 2018), and CutMix (Yun et al., 2019). Training is performed until convergence using an early-stopping criterion, with a maximum limit of 1000 epochs.
C.2 Calibration
The calibration of the early-exit network is performed using a gradient-based approach on a held-out validation set that is not used to optimize the EENN. During calibration, the EENN parameters are frozen, and temperature scaling modules are attached to each exit head. The calibration objective minimizes the negative log-likelihood (NLL). Each exit head is calibrated independently with its own temperature parameter. For each experimental configuration, three calibration models are trained, each with different initial random weights. No data augmentations are applied during calibration in order to match the test-time data distribution as closely as possible.
C.3 Confidence Correction
EEFP Confidence Correctors (C.C.) consist of two-layer MLPs with a hidden dimension of and are trained on the same held-out dataset used for calibration. The EENN parameters remain frozen, and each C.C. head is optimized independently by minimizing the binary cross-entropy loss between the target confidence defined in Equation (2) and the predicted value. Each C.C. head has its own set of learnable parameters. As in calibration, three C.C. models are trained per experiment, each with different initial random weights. Confidence corrector training is performed without data augmentation to ensure consistency with the test-time data distribution.
C.4 Evaluation in the Early-Exit Environment
For each model and each exit head, the decision thresholds are determined using a validation set that is disjoint from the test set. Threshold selection follows the heuristic proposed by (Huang et al., 2018), which derives exit-specific thresholds based on the empirical distribution of confidence scores.
Given a predefined FLOPs budget, the network is expected to terminate at each exit for a prescribed fraction of input samples. The allocation of samples across exits is controlled by a parameter . Specifically, at the -th exit, the required fraction of samples that terminate is defined as:
[TABLE]
During inference on the test set, each model computes confidence scores for incoming samples and applies its corresponding set of thresholds to determine the exit point and, consequently, the final prediction.
Appendix D Full Results
Due to space constraints, we presented only the averaged results in the main article. In Table˜7 and Table˜8 we present the full results including standard deviations for Table˜1 and Table˜3, respectively.
Appendix E Extended Comparison to Calibration Methods
In the main paper, we compared our method only to temperature scaling, which is one of the most popular post-hoc calibration methods. In Table˜9 we provide extended evaluation results that also include vector scaling and matrix scaling (Guo et al., 2017). Additionally, we test confidence correctors with confidence target (CCCT) – we use an MLP with the same architecture as for our method, but instead of using targets as defined in Equation˜2, we use:
[TABLE]
which essentially calibrates the meta-classifier instead of aligning it with EEFP target. The results demonstrate that the improvements provided by our approach are not due to the larger number of parameters of MLP that we use in comparison to calibration methods. While the CCCT variant is effective in calibrating the exits, the overall performance of the model is inferior to that of the one where EEFP is optimized instead.
Appendix F Robustness Analysis
To assess the robustness of our approach, we evaluate the models on DomainNet (Peng et al., 2019), which contains the same set of classes across six different domains. We train the model on the real domain and evaluate it across each domain to assess in-distribution versus out-of-distribution performance. The results are presented in Table˜10. Our approach is robust to distributional shifts and achieves better scores than calibration in 10 out of 15 cases. We also observe that: (1) temperature scaling actually decreases ECE on unseen domains, and (2) EEFP still correlates well with the overall cost-accuracy trade-off.
Appendix G Generalization to Other Early-Exit Methods
In this section, we extend our evaluation to include other early exit systems. Our failure prediction-based method can be applied to any early-exit method that uses classifier confidence for exit decisions. Since our approach is complementary to these methods, we apply our method to each of these models to explore whether the performance improvements that we observed in the main paper do not diminish. In Table˜11 we report the results of our method applied to Gradient Equilibrium (Li et al., 2019), Zero Time Waste (Wójcik et al., 2023), and early-exit distillation (Phuong and Lampert, 2019). Our approach provides consistent gains in the overall performance of the network in all cases. Crucially, our method also consistently achieves better EEFP scores while being miscalibrated compared to temperature scaling. This confirms our hypothesis and strengthens the main results of our work.
Appendix H Contributions
Piotr Kubaty led the empirical investigation, conducting the majority of the experiments and generating the data visualizations. He conceptualized the architectural refinements for the meta-classifier, specifically the top-k selection mechanism. He also made significant contributions to the manuscript.
Filip Szatkowski served as the primary coordinator for the writing process. He managed the structural development of the manuscript across multiple versions and was responsible for ensuring narrative clarity and logical flow throughout the paper.
Grzegorz Choczyński conducted the experimental evaluations demonstrating the generalization of the proposed approach to other early-exit frameworks (Table 11). He also performed several additional experiments that were not included in the final manuscript.
Eric Nalisnick developed the theoretical framework of the paper. He authored the formal analysis regarding the interplay between calibration, EENN triviality, and the conditional anytime property (Section˜2.2 and Section˜3) and provided crucial refinements to the final manuscript.
Bartosz Wójcik conceptualized the core research direction and supervised the project. He identified the fundamental disconnect between local exit calibration and global EENN performance, explicitly drawing the connection to the failure prediction literature. He defined the optimal stopping decision for EENNs, formulated the Early-Exit Failure Prediction (EEFP) score, and proposed the meta-classifier approach that directly optimizes for this target.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1C. Corbière, N. Thome, A. Bar-Hen, M. Cord, and P. Pérez (2019) Addressing failure prediction by learning model confidence . In Advances in Neural Information Processing Systems (Neur IPS) , Cited by: §1 , §4.1 .
- 2P. Dawid (1982) The well-calibrated bayesian . Journal of the American Statistical Association 77 ( 379 ), pp. 605–610 . Cited by: §2.2 .
- 3J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Image Net: a large-scale hierarchical image database . In CVPR , Cited by: §5 .
- 4A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x 16 words: transformers for image recognition at scale . In 9th International Conference on Learning Representations, ICLR , Cited by: §5 .
- 5B. Fang, X. Zeng, F. Zhang, H. Xu, and M. Zhang (2020) Flexdnn: Input-adaptive on-device deep learning for efficient mobile vision . In 2020 IEEE/ACM Symposium on Edge Computing (SEC) , Cited by: §1 .
- 6A. Ghodrati, B. E. Bejnordi, and A. Habibian (2021) Frameexit: Conditional early exiting for efficient video recognition . In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , Cited by: §1 .
- 7C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks . In International Conference on Machine Learning , pp. 1321–1330 . Cited by: Appendix E , §2.2 , §5.4 .
- 8A. Hájek (2007) The reference class problem is your problem too . Synthese 156 , pp. 563–585 . Cited by: §2.2 .
