JUCAL: Jointly Calibrating Aleatoric and Epistemic Uncertainty in Classification Tasks

Jakob Heiss; S\"oren Lambrecht; Jakob Weissteiner; Hanna Wutte; \v{Z}an \v{Z}uri\v{c}; Josef Teichmann; Bin Yu

arXiv:2602.20153·stat.ML·February 24, 2026

JUCAL: Jointly Calibrating Aleatoric and Epistemic Uncertainty in Classification Tasks

Jakob Heiss, S\"oren Lambrecht, Jakob Weissteiner, Hanna Wutte, \v{Z}an \v{Z}uri\v{c}, Josef Teichmann, Bin Yu

PDF

Open Access 3 Reviews

TL;DR

JUCAL is a new calibration method that jointly adjusts aleatoric and epistemic uncertainties in ensemble classifiers, significantly improving predictive uncertainty estimates and reducing inference costs.

Contribution

We introduce JUCAL, a simple calibration algorithm that jointly calibrates aleatoric and epistemic uncertainties for any trained ensemble, outperforming state-of-the-art methods.

Findings

01

JUCAL reduces negative log-likelihood by up to 15%.

02

JUCAL decreases predictive set size by up to 20%.

03

JUCAL enables smaller ensembles to outperform larger temperature-scaled ensembles.

Abstract

We study post-calibration uncertainty for trained ensembles of classifiers. Specifically, we consider both aleatoric (label noise) and epistemic (model) uncertainty. Among the most popular and widely used calibration methods in classification are temperature scaling (i.e., pool-then-calibrate) and conformal methods. However, the main shortcoming of these calibration methods is that they do not balance the proportion of aleatoric and epistemic uncertainty. Not balancing these uncertainties can severely misrepresent predictive uncertainty, leading to overconfident predictions in some input regions while being underconfident in others. To address this shortcoming, we present a simple but powerful calibration algorithm Joint Uncertainty Calibration (JUCAL) that jointly calibrates aleatoric and epistemic uncertainty. JUCAL jointly calibrates two constants to weight and scale epistemic and…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 3

Strengths

The method applies temperature scaling in a new way, taking into account ensemble differences. The method is intuitive, simple to understand and can be applied post-hoc to already trained ensembles. This makes the method easy to use.

Weaknesses

The experiments could include other post-hoc calibration methods as a comparison. Also, calibrate-then-pool could be added to comparison, since it is the special case of JUCAL when $c_2 = 1$. Section 5.1 briefly mentions that ensemble members are based on pretrained architectures like GPT-2, BERT, and T5, but it remains ambiguous whether the ensembles consist of a single architecture or multiple architectures. Since the nature of the ensemble affects both uncertainty decomposition and calibrati

Reviewer 02Rating 4Confidence 4

Strengths

- The paper is generally well written - JUCAL is conceptually simple, easy to implement, and can be applied to off-the-shelf ensembles without model re-training. - The experiments compare across ensemble sizes and demonstrate consistent improvements on NLL and predictive set size.

Weaknesses

**Lack of ablation studies on core claims** The paper's main idea is that c1 calibrates aleatoric uncertainty while c₂ calibrates epistemic uncertainty. However, no ablation studies are provided to validate this claimed separation. Without these ablations, the interpretation of c1 and c2 as targeting specific uncertainty types remains speculative. **Lack of support for main claims** The conclusion states: “our approach provides a principled and impactful advance in uncertainty calibration for

Reviewer 03Rating 2Confidence 4

Strengths

* The paper addresses the important and underexplored problem of balancing aleatoric and epistemic components during uncertainty calibration. Standard methods like temperature scaling treat predictive uncertainty as a single quantity, and the idea of disentangling these sources for more granular control is well-motivated. * The proposed method, JUCAL, is simple, intuitive, and computationally inexpensive. * The experimental results on several text classification tasks are promising.

Weaknesses

My main concerns are regarding the strength of the paper's core claims, the limited experimental scope, and the lack of crucial baselines. * **Overstated Disentanglement Claims:** The central claim of "jointly calibrating aleatoric and epistemic uncertainty" seems overstated and is not supported by sufficient experimental evidence. The primary evidence for successful disentanglement is presented in Figure 5, which shows that the estimated epistemic uncertainty (EU) decreases with more trainin

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Machine Learning and Data Classification