ImmUQBench: a benchmark on uncertainty quantification of protein immunogenicity prediction
Alif Bin Abdul Qayyum, Amir Hossein Rahmati, Xiaoning Qian, Byung-Jun Yoon

TL;DR
This paper introduces ImmUQBench, a benchmark for evaluating uncertainty quantification methods in predicting protein immunogenicity, aiming to improve the reliability of AI/ML models in therapeutic antigen design.
Contribution
The novel contribution is the creation of ImmUQBench, a standardized benchmark for assessing UQ methods in immunogenicity prediction under data-scarce conditions.
Findings
Different UQ strategies show varying abilities in capturing predictive uncertainty and maintaining robustness.
ImmUQBench provides a standardized evaluation framework for model accuracy, calibration, and robustness.
The benchmark helps identify the most trustworthy UQ methods for immunogenicity prediction.
Abstract
Discovering antigen proteins, capable of eliciting desired immune responses, is of paramount importance in developing immunogenic therapeutics for combating various diseases, particularly autoimmune disorders, infectious diseases, as well as cancers. Despite recent advances in artificial intelligence (AI) and machine learning (ML), accurate and generalizable immunogenicity prediction remains challenging due to limited labeled data and model over-simplifications. Uncertainty quantification (UQ) approaches are commonly used to address the aforementioned challenges when applying AI/ML methods with limited training data, aiming to reduce the risk of catastrophic errors. This study aims to systematically evaluate the performance of UQ methods for antigen immunogenicity prediction and to establish a benchmark for assessing model reliability in data-scarce setting. We here present ImmUQBench,…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11| Dataset | Model | ESMC | ProstT5 | Ankh | ESM2 | Prot Bert |
|---|---|---|---|---|---|---|
| Virus | Deterministic | 0.0344 | 0.0457 | 0.0442 | 0.0313 | 0.0513 |
| TS | 0.0354 | 0.0420 | 0.0390 |
| 0.0497 | |
| LA |
| 0.0382 |
| 0.0578 | 0.0451 | |
| DVBLL | 0.0514 | 0.0550 | 0.0450 | 0.0656 |
| |
| EDL | 0.1385 | 0.0657 | 0.1063 | 0.1539 | 0.1031 | |
| SWAG | 0.1313 | 0.0993 | 0.1577 | 0.1439 | 0.1157 | |
| MCD | 0.0344 | 0.0454 | 0.0449 | 0.0316 | 0.0492 | |
| DKL | 0.0398 | 0.0820 | 0.0839 | 0.0445 | 0.0926 | |
| SGLD | 0.0341 |
| 0.0442 | 0.0559 | 0.0487 | |
| Bacteria | Deterministic | 0.0957 | 0.1101 | 0.1102 | 0.0696 | 0.0876 |
| TS | 0.0862 | 0.0966 | 0.0954 | 0.0569 | 0.0806 | |
| LA | 0.1049 | 0.0954 | 0.0976 | 0.0994 | 0.1002 | |
| DVBLL | 0.1105 | 0.1033 | 0.0694 | 0.0963 | 0.0764 | |
| EDL |
| 0.0977 | 0.1202 | 0.1083 | 0.1016 | |
| SWAG | 0.1083 | 0.0966 | 0.1250 | 0.0886 | 0.1127 | |
| MCD | 0.0956 | 0.1086 | 0.1085 | 0.0668 | 0.0860 | |
| DKL | 0.0862 | 0.0553 | 0.1098 | 0.1139 | 0.0723 | |
| SGLD | 0.1256 |
|
|
|
| |
| Tumor | Deterministic | 0.0753 | 0.1796 | 0.0994 | 0.1217 | 0.1532 |
| TS | 0.0781 | 0.1644 | 0.1000 | 0.0952 | 0.1421 | |
| LA | 0.0979 | 0.1689 | 0.1805 | 0.1233 | 0.1131 | |
| DVBLL | 0.1333 | 0.2191 | 0.1314 | 0.1297 | 0.1420 | |
| EDL | 0.1219 |
| 0.1164 | 0.0832 |
| |
| SWAG | 0.1275 | 0.1196 | 0.1265 | 0.1252 | 0.1191 | |
| MCD |
| 0.1792 | 0.1033 | 0.1140 | 0.1541 | |
| DKL | 0.1135 | 0.1611 | 0.1194 | 0.1516 | 0.1341 | |
| SGLD | 0.2037 | 0.0848 |
|
| 0.0871 |
| Dataset | Model | ESMC | ProstT5 | Ankh | ESM2 | Prot Bert |
|---|---|---|---|---|---|---|
| Virus | Deterministic | 0.2559 | 0.3066 | 0.3046 | 0.2616 | 0.2886 |
| TS | 0.2525 | 0.2923 | 0.2889 |
| 0.2819 | |
| LA |
| 0.2877 | 0.2790 | 0.2951 | 0.2909 | |
| DVBLL | 0.2915 | 0.3207 | 0.4686 | 0.3474 | 0.2873 | |
| EDL | 0.3825 | 0.2771 | 0.3024 | 0.3794 | 0.3043 | |
| SWAG | 0.3189 | 0.3054 | 0.3600 | 0.3304 | 0.3129 | |
| MCD | 0.2549 | 0.3017 | 0.2995 | 0.2612 | 0.2868 | |
| DKL | 0.3039 | 0.3984 | 0.3670 | 0.2937 | 0.3644 | |
| SGLD | 0.2855 |
|
| 0.2829 |
| |
| Bacteria | Deterministic | 0.4961 | 0.6113 | 0.5823 | 0.4501 | 0.5059 |
| TS | 0.4784 | 0.5545 | 0.5268 | 0.4375 | 0.4807 | |
| LA | 0.5780 | 0.5275 | 0.5599 | 0.5091 | 0.5550 | |
| DVBLL | 0.5663 | 0.6550 | 0.6513 | 0.5986 | 0.5104 | |
| EDL |
| 0.4917 | 0.5189 | 0.5436 | 0.4599 | |
| SWAG | 0.4950 | 0.5153 | 0.4980 | 0.4705 | 0.4697 | |
| MCD | 0.4932 | 0.5968 | 0.5725 | 0.4484 | 0.5000 | |
| DKL | 0.5011 | 0.5074 | 0.5370 | 0.5357 | 0.4797 | |
| SGLD | 0.6303 |
|
|
|
| |
| Tumor | Deterministic | 0.4930 | 1.0011 | 0.5210 | 0.5398 | 0.6798 |
| TS | 0.4919 | 0.8615 | 0.5042 |
| 0.6256 | |
| LA |
| 0.7930 | 0.7586 | 0.6226 | 0.5894 | |
| DVBLL | 0.5827 | 0.9336 | 0.6086 | 0.6288 | 0.6521 | |
| EDL | 0.5244 | 0.6015 | 0.6289 | 0.5639 | 0.5975 | |
| SWAG | 0.5070 | 0.5694 | 0.6111 | 0.5556 | 0.5669 | |
| MCD | 0.4932 | 0.9715 | 0.5175 | 0.5311 | 0.6710 | |
| DKL | 0.5526 | 0.6982 | 0.6275 | 0.5972 | 0.5727 | |
| SGLD | 0.8403 |
|
| 0.5275 |
|
| Dataset | Model | ESMC | ProstT5 | Ankh | ESM2 | Prot Bert |
|---|---|---|---|---|---|---|
| Virus | Deterministic | 0.0746 | 0.0852 | 0.0844 | 0.0796 | 0.0821 |
| TS | 0.0743 | 0.0842 | 0.0833 |
| 0.0822 | |
| LA |
| 0.0845 | 0.0855 | 0.0852 | 0.0823 | |
| DVBLL | 0.0775 | 0.0876 | 0.1504 | 0.1100 | 0.0792 | |
| EDL | 0.1154 |
| 0.0824 | 0.1125 | 0.0845 | |
| SWAG | 0.0914 | 0.0852 | 0.1078 | 0.0962 | 0.0911 | |
| MCD | 0.0745 | 0.0849 | 0.0842 | 0.0796 | 0.0821 | |
| DKL | 0.0892 | 0.1232 | 0.1174 | 0.0853 | 0.1096 | |
| SGLD | 0.0852 | 0.0836 |
| 0.0854 |
| |
| Bacteria | Deterministic | 0.1503 | 0.1642 | 0.1521 | 0.1403 | 0.1510 |
| TS | 0.1484 | 0.1601 | 0.1483 | 0.1383 | 0.1485 | |
| LA | 0.1487 | 0.1656 | 0.1565 | 0.1499 | 0.1509 | |
| DVBLL | 0.1587 | 0.1685 | 0.2024 | 0.1480 | 0.1424 | |
| EDL |
| 0.1597 | 0.1696 | 0.1840 | 0.1472 | |
| SWAG | 0.1376 |
| 0.1488 |
| 0.1447 | |
| MCD | 0.1502 | 0.1636 | 0.1515 | 0.1402 | 0.1506 | |
| DKL | 0.1546 | 0.1653 | 0.1592 | 0.1601 | 0.1488 | |
| SGLD | 0.1623 | 0.1473 |
| 0.1373 |
| |
| Tumor | Deterministic | 0.1675 | 0.2255 | 0.1687 | 0.1703 | 0.2040 |
| TS | 0.1674 | 0.2197 | 0.1661 |
| 0.1981 | |
| LA |
| 0.2221 | 0.2006 | 0.1841 | 0.1892 | |
| DVBLL | 0.1753 | 0.2441 | 0.1925 | 0.1859 | 0.2014 | |
| EDL | 0.1727 | 0.2069 | 0.2192 | 0.1909 | 0.2054 | |
| SWAG | 0.1652 | 0.1880 | 0.1809 | 0.1827 |
| |
| MCD | 0.1676 | 0.2246 | 0.1680 | 0.1698 | 0.2031 | |
| DKL | 0.1800 | 0.2232 | 0.2052 | 0.2043 | 0.1931 | |
| SGLD | 0.2379 |
|
| 0.1806 | 0.1833 |
- —Advanced Research Projects Agency for Health10.13039/100023015
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topicsvaccines and immunoinformatics approaches · Machine Learning in Bioinformatics · Immunotherapy and Immune Responses
Introduction
Immunogenicity refers to the ability of a pathogen to provoke host immune responses. Identifying which pathogen proteins are likely to trigger an immune response is an absolutely vital step when developing new protein-based immunogenic therapeutics, e.g. vaccines [1, 2]. This process, often called immunogenicity prediction, is a key task that helps scientists anticipate and manage potential problems before a therapeutic even reaches clinical trials. Deep learning models have significantly advanced this task by enabling scalable and accurate predictions [3, 4]. However, their effectiveness is often hindered by the scarcity of labeled data and a mismatch between the task complexity and model assumptions, leading to suboptimal performance and limited generalizability, particularly when designing for broad viral efficacy.
The advent of large language models (LLMs) marks a pivotal advancement in natural language processing (NLP), fundamentally reshaping its capabilities [5–7]. This progress has, in turn, facilitated the emergence of general-purpose computational tools within the field of biology. In particular, the adaptation of language modeling techniques to proteins has led to the emergence of powerful protein language models (PLMs), which have demonstrated strong performance on a variety of downstream tasks [8–10]. These models frequently surpass traditional approaches and offer improved generalization capabilities.
Despite their empirical success, both traditional ML prediction models and PLM integrated DL models designed for downstream tasks often exhibit overconfident predictions and are prone to generating hallucinated outputs [11, 12], raising concerns about their reliability and trustworthiness in sensitive applications, e.g. safety and efficacy related therapeutic design. To mitigate these limitations, the machine learning community has increasingly turned to uncertainty quantification (UQ) techniques. Broadly, UQ methods fall into two categories: Bayesian approaches [13–15], which provide a principled probabilistic framework but can be computationally intensive or impractical, and non-Bayesian approaches [16–18], which are often more tractable and performant but lack strong theoretical guarantees.
Despite the increasing importance of UQ in immunogenicity prediction, there is no comprehensive study comparing different UQ techniques for varied downstream applications. In this work, we fill this void by introducing ImmUQBench, a benchmark designed to systematically evaluate a diverse set of UQ methods in the context of immunogenicity prediction. We compare both Bayesian and non-Bayesian methods across various experimental scenarios, including in-distribution and out-of-distribution settings, to assess their predictive accuracy, uncertainty estimation quality, and robustness to distributional shifts. Additionally, we examine the impact of different protein sequence encoding schemes, highlighting robustness to alternate encodings that express the same underlying sequence information. To ensure a concise and focused presentation, all results in the main manuscript, unless otherwise specified are obtained using ESM-Cambrian [9] as the PLM. We adopt ESM-Cambrian because it demonstrates an overall superior empirical predictive performance and represents the most recent generation among the PLMs considered. Results based on the other PLMs are provided in the supplementary material. Figure 1 provides a detailed illustration of the ImmUQBench framework. Throughout all our experiments, although there is not a clear winner (as expected) among UQ techniques, almost all methods outperformed their deterministic alternative across nearly all metrics.
Benchmarking protein immunogenicity prediction. We adopt VenusVaccine [19] as the backbone architecture as an exemplar using both sequence and structure level representations, along with physico-chemical descriptors for immunogenicity prediction. We evaluate different uncertainty quantification strategies on the corresponding uncertainty-aware model variants according to both predictive and UQ metrics.
To the best of our knowledge, ImmUQBench is the first comprehensive benchmark to assess UQ methods for immunogenicity prediction on three distinct immunogenic protein data sources, an essential step in therapeutic design, including vaccine development. We briefly summarize our contributions below:
Pioneering Benchmark in Immunogenicity: We introduce ImmUQBench, a benchmark for uncertainty quantification in immunogenicity prediction. Extensive Evaluation of UQ across Several Data Distributions: We systematically evaluate a wide range of Bayesian and non-Bayesian UQ approaches on three distinct immunogenic data sources, both across in-distribution and out-of-distribution scenarios. Evaluation of Various Data Representation and Model Ablation: We provide insights through extensive experiments and ablation studies to support antigen design and broadly effective therapeutic development.
The remainder of this paper is organized as follows: Section UQ for Deep Neural Networks reviews existing UQ approaches for deep neural network models; Section ImmUQBench presents the setup of ImmUQBench; Section Experiments describes the experimental settings and results; Section Related Works reviews related works; and Section Conclusion & Limitations concludes the study.
UQ for deep neural networks
As Deep Learning models have become widely adopted across a broad range of tasks, their trustworthiness and reliability of their prediction have become essential and critically important, as they struggle to distinguish between in and out-of-distribution datasets [20, 21] as well as being sensitive to domain shift [22]. This is particularly important in safety critical tasks where data is often scarce and wrong predictions can lead to severe consequences. Hence, it is essential that these models express their uncertainty when confronting out-of-distribution data. Different approaches have been developed and utilized for uncertainty quantification that can be broadly categorized into Bayesian methods [13–15, 23–28] and non-Bayesian methods [16–18, 29–32].
In this work, for each category of UQ methods, we consider widely used and representative approaches.
Bayesian methods
We begin with Monte Carlo (MC)-Dropout [15], which interprets dropout as a form of approximate Bayesian inference in deep Gaussian processes. This interpretation allows dropout to capture epistemic uncertainty by maintaining stochasticity at test time. Specifically, model predictions are obtained by performing multiple stochastic forward passes with dropout enabled and computing the predictive mean and variance from these passes. Formally, the predictive mean is approximated by averaging outputs over T stochastic passes, while the variance is classically estimated as the sample variance augmented by a model precision term , where corresponds to the assumed observation noise precision that can be identified by dropout rate, dataset size, and weight decay. In this work, we omit the term and use the empirical variance of MC dropout predictions as a measure of model uncertainty. This enables tractable Bayesian approximation without incurring additional test-time complexity.
Next, we consider variational Bayesian last layer (VBLL) [33], which provides an efficient sampling-free approach to Bayesian modeling by maintaining a posterior only over the final layer of a neural network. By casting the training objective as a deterministic variational bound, VBLL introduces minimal overhead and is easily integrated into existing architectures, yielding principled uncertainty estimates without requiring stochastic forward passes.
We also include stochastic variational deep kernel learning (SVDKL) [34], which synergizes the representation power of deep neural networks with the non-parametric flexibility of Gaussian processes. SVDKL extends deep kernel learning (DKL) to classification and multi-task learning using a scalable variational inference framework. This allows for training on large-scale datasets via stochastic gradients, and supports more expressive covariance structures compared to prior DKL models.
SWAG [35] (stochastic weight averaging-Gaussian) is another Bayesian approximation technique that constructs a Gaussian posterior over network weights by leveraging the trajectory of stochastic gradient descent (SGD). It estimates the posterior mean via the running average of SGD iterates and approximates the covariance using both a low-rank approximation based on recent deviations from the mean and a diagonal component derived from the second moment. This results in a scalable approach to uncertainty estimation that enables Bayesian model averaging through weight sampling at test time.
Stochastic gradient Langevin dynamics (SGLD) [36] is another scalable Bayesian learning algorithm designed for large datasets, combining stochastic optimization with principles from Langevin dynamics. At each iteration, it performs a gradient-based update similar to stochastic gradient descent (SGD) but injects carefully calibrated Gaussian noise into the parameter updates. Under appropriate conditions on the step size and noise variance, the sequence of parameter samples produced by SGLD asymptotically follows the true Bayesian posterior, allowing for uncertainty quantification without the heavy computational cost of traditional Markov chain Monte Carlo methods.
Finally, we use the Laplace approximation (LA) [26], which approximates the posterior distribution over model weights with a Gaussian centered at the maximum a posteriori (MAP) estimate. This is achieved by performing a second-order Taylor expansion of the log-posterior, resulting in a Gaussian with covariance given by the inverse Hessian of the log-posterior evaluated at the MAP point. Formally, the posterior is approximated as , where is the Hessian of the negative log posterior and is the MAP. This method provides a fast and principled estimate of uncertainty without requiring sampling during inference.
Non-Bayesian methods
Among non-Bayesian approaches, we consider Deep Ensembles [30], which train an ensemble of M neural networks with different initializations. Each model outputs a probabilistic prediction, and the ensemble prediction is computed by averaging these outputs. This ensemble captures both model and data uncertainty and has been shown to outperform many Bayesian approximations in terms of calibration and robustness. Each network is typically trained using proper scoring rules such as negative log-likelihood to ensure meaningful probabilistic outputs.
We also evaluate evidential deep learning (EDL) [37], which explicitly models predictive uncertainty by placing a Dirichlet distribution over class probabilities. Rather than producing point estimates via softmax, the network outputs non-negative “evidence” values that parameterize the Dirichlet. This allows the model to represent both aleatoric and epistemic uncertainty in a unified framework. The loss function combines the Bayes risk (under an L2 norm) with a KL-divergence term that regularizes the model to prevent overconfident predictions, enabling uncertainty-aware classification from a single forward pass.
Finally, as all the aforementioned methods ultimately aim to produce well-calibrated and reliable predictive distributions, we also include Temperature Scaling (TS) [20] as a baseline for comparison. Temperature scaling is a post hoc calibration technique that adjusts the confidence of a classifier by optimizing a single non-negative scalar parameter on a validation set, typically using negative log-likelihood as the objective. By dividing the logits by T before applying the softmax, this method effectively increases the entropy of the predictive distribution, leading to better-calibrated output probabilities without altering the model’s accuracy.
ImmUQBench
In this work, we focus on investigating UQ approaches in identifying whether proteins—originating from humans, bacteria, or viruses—are immunogenic. This task can be cast as a binary classification problem, where the model is trained to predict whether a given protein (or peptide segment) is an immunogenic antigen: an immunogenic antigen is an antigen that is capable of eliciting immune response when exposed to human immune system.
Immunogenicity
Immunogenicity is linked to the therapeutic use of proteins and can result in serious clinical outcomes, including reduced treatment effectiveness or potentially life-threatening complications. Naturally, determining the cause of immunogenicity in biologic therapies is a necessary pursuit [38]. Particularly, immunogenicity prediction has become a central component in reverse vaccinology aiming to identify antigens that are capable of eliciting immune responses resulting in the formation of memory cells within the host organism [2, 19].
Researchers are increasingly focused on fast and precise prediction of immunogenic antigens for vaccine development, as this approach minimizes costs and associated risks, while supporting safe and effective responses to infectious disease threats. [39] use a simple linear scoring function to calculate immunogenicity score. DeepImmuno [3] introduces two deep learning models aimed at modeling T-cell immunity, which is crucial for the development of cancer immunotherapies and vaccines. Specifically, DeepImmuno-CNN predicts immunogenicity, while DeepImmuno-GAN generates immunogenic peptides. TRAP [40] presents a robust deep learning framework for predicting CD8^+^ T-cell epitopes from both pathogenic and self-peptides. It also estimates the immunogenic potential of MHC-I peptides by providing a prediction score along with a confidence measure. Some current methods also consider using physiochemical properties of amino acids for immunogenicity prediction.
As our core objective is to integrate and evaluate different UQ approaches for immunogenicity prediction, we adopt VenusVaccine [19], a cutting-edge multi-modal deep learning framework. Leveraging a dual-attention mechanism, VenusVaccine integrates sequence, structural, and physicochemical information to effectively interpret immunogenicity.
Protein language models (PLMs)
The adaptation of LLMs—the advent of which marked a major shift in natural language processing (NLP)—to protein sequences has resulted in the emergence of advanced protein language models (PLMs) [10, 41, 42]. This adaptation—hence modeling of protein sequences—was enabled by equating words with amino acids and interpreting the entire protein sequences as sentences [43, 44]. Via self-supervised learning, generic PLMs are often pre-trained on large datasets of amino acid sequences, which then due to learning contextual residue representations [41, 42], they can serve as feature extractors for a wide-range of protein tasks, such as prediction of structure, binding residues, sub-cellular localization, and fold classification.
Problem setup
Here, we formalize the problem, incorporating multi-modal information from protein sequences, structures, and physicochemical properties. Following [19], sequence and structure embeddings are extracted from pre-trained protein language models (PLMs). These embeddings are passed through the dual-attention module of VenusVaccine, which summarizes them into a unified representation:
where
denote the sequence and structure embeddings of the amino acid sequence of length L, respectively, and d is the embedding dimension. The attention output H, along with the sequence and physicochemical features, is concatenated and passed to a classifier:
where is a deterministic classifier parameterized by .
Bayesian Methods: In this work, for evaluating Bayesian methods, we treat as a random variable to enable uncertainty estimation and to evaluate the performance of different uncertainty quantification methods. Thus, the predictive distribution is given by:
In ImmUQBench, we have implemented MC-Dropout [15], SWAG [35], DVBLL [33], SVDKL [34], and LA [26].
As non-Bayesian methods employ varied and often method-specific mechanisms for uncertainty estimation, a general predictive formulation analogous to the Bayesian case is not readily available. Hence, we briefly outline the evaluation formulation for Deep Ensembles and EDL. In addition, we describe TS which is a widely-used calibration technique for adjusting predicted probabilities.
Deep Ensemble: By training M neural networks independently, we estimate the uncertainty. Specifically, each model outputs a prediction, and the ensemble predictive distribution is computed as the average,
The diversity among the members captures the uncertainty. Instead of training the same model with different initialization or data shuffling, we employed different data representation for different models in the ensemble. Each ensemble consists of five models, each with similar architecture following VenusVaccine [19]. However, the amino acid sequence level encoding, , for different model in the ensemble comes from different PLMs. The 5 PLMs used in this work are: ESM-Cambrian [9], ProstT5 [45], Ankh [46], ESM-2 [47], and Prot-Bert [48].
EDL: In a binary classification, EDL models the class probability as a Beta distribution, . Considering the network outputs the evidence parameters, , where , the predictive probability for class i is given by,
Uncertainty is then captured through the variance of the Beta distribution.
TS: To improve the calibration of predicted probabilities, TS introduces a scalar temperature parameter , which is optimized on a validation set by minimizing the negative log-likelihood. This adjustment rescales the logit outputs to produce softer probability distributions with . That is, larger values of T spread probability mass more evenly across classes, increasing the entropy. Specifically, given the logit vector , the calibrated probabilities are computed as:
where denotes the softmax function. TS adjusts confidence levels without affecting the model’s accuracy, making it a simple yet effective post hoc calibration method.
Experiments
Uncertainty evaluation metrics
We assess uncertainty quantification using three established metrics: Expected Calibration Error (ECE), negative log-likelihood (NLL), and the Brier score, which have been commonly employed in the literature. ECE and Brier scores are considered as calibration metrics while NLL is mostly regarded as an indicator of overconfidence.
A calibrated model is the one whose predicted probabilities match the empirical frequency of the output [49]. A well-calibrated model can prevent wrong decisions in case of high uncertainty. ECE is used to assess calibration. Particularly by partitioning predictions into M equally-spaced bins based on their prediction confidence, ECE can be calculated as [20, 49],
with N indicating the size of the dataset, and and the average accuracy and confidence in bin with size respectively.
Calibration can also be evaluated by the Brier score, which is a proper scoring rule and a widely accepted tool in the context of uncertainty quantification due to its ability in assessing the quality of probabilistic predictions [49, 50]. Especially, it captures how correct a model is and if it expresses proper confidence levels, by measuring mean squared difference between predicted probabilities and predictions. For a binary class, it is,
On the other hand, NLL is often used to detect overconfidence. It is computed as the negative log-probability assigned to the true label,
When a model is overconfident in an incorrect prediction, it assigns a high probability to the wrong class. Therefore, the log loss becomes very large, that results in a high NLL.
Dataset
In this study, we use ImmunoDB, an immunogenicity database comprising 7216 labeled antigens derived from three distinct sources: bacteria, viruses, and humans [19]. Each antigen is labeled as either immunogenic (positive) or non-immunogenic (negative). The dataset is constructed through a combination of literature curation, database mining, and bioinformatics filtering, with most positive samples originating from previously published studies. To ensure quality, redundant sequences and samples from tail regions were filtered out. This process resulted in three curated subsets: Immuno-Virus, Immuno-Bacteria, and Immuno-Tumor. Owing to its rigorous quality control, diverse species coverage, and comprehensive sourcing, ImmunoDB provides a valuable benchmark for evaluating the robustness and generalizability of immunogenicity prediction models. To the best of our knowledge, at the time this study was conducted, it represents the most extensive labeled antigen resource available for this task. For the sake of simplicity, we will interchangeably use Immuno-Virus and Virus for the rest of this paper (similar policy for Bacteria and Tumor).
Backbone architecture
In this work, we use VenusVaccine [19], a supervised deep learning model for immunogenicity prediction, as the backbone in our experiments. The model integrates sequence, structural, and physicochemical information using a dual attention mechanism. It encodes protein sequences with pretrained PLMs and represents structures at both atomic and peptide levels using FoldSeek [51] and ESM-3 [52], respectively. Handcrafted physicochemical descriptors are also included to enhance biological relevance.
The model employs a hierarchical cross-attention framework that fuses sequence and structure representations at multiple scales, enabling rich interaction across modalities. Attention pooling then compresses amino acid-level features into a protein-level vector by highlighting key regions, which is used for final binary classification of immunogenicity.
Experimental settings
Unless otherwise specified, all reported results are performance statistics over five independent runs. For training the models, we followed the similar techniques and hyperparameters adopted in VenusVaccine [19]. For DVBLL, LA, and SVDKL, we modified the last MLP segment of the original VenusVaccine architecture by adding an extra linear layer and converted this extra linear layer as the probabilistic segment. This was aimed at promoting stable training while maintaining reasonable computational costs. The additional linear layer has dimension 64.
For BNNs, we obtained 64 MC sample predictions. All reported results in this work, except [Tables 1–3](#iqag003-T1 iqag003-T2 iqag003-T3), utilize protein sequence embeddings derived from the ESM-Cambrian protein language model [9]. It should be emphasized that the ensemble model also uses sequence embeddings extracted from all five different PLMs including ESM-Cambrian.
ID (in-distribution) results
In-distribution evaluation refers to the scenario where the train and test sets both originate from the same immunogenic data source, e.g. virus, bacteria or tumor (test data source = train data source). [Figures 2–4](#iqag003-F2 iqag003-F3 iqag003-F4) rank the performance of different models according the performance across different predictive and UQ metrics.
In-Distribution results for Immuno-Virus dataset. For the predictive metrics (accuracy, precision, recall, F1-Score, AUC ROC), the optimal performance indicates the higher the better; whereas for the UQ metrics (ECE, NLL, brier score), the optimal performance indicates the lower the better. The values on each bar plotted on all the subplots show the mean of the values, and the error bars show the range of values obtained from all the different experimental seeds. The deterministic, Bayesian, and non-Bayesian methods are plotted in blue, green and red, respectively.
In-Distribution results for Immuno-Bacteria dataset. For the predictive metrics (accuracy, precision, recall, F1-Score, AUC ROC), the optimal performance indicates the higher the better; whereas for the UQ metrics (ECE, NLL, brier score), the optimal performance indicates the lower the better. The values on each bar plotted on all the subplots show the mean of the values, and the error bars show the range of values obtained from all the different experimental seeds. The deterministic, Bayesian, and non-Bayesian methods are plotted in blue, green, and red, respectively.
In-Distribution results for Immuno-Tumor dataset. For the predictive metrics (accuracy, precision, recall, F1-Score, AUC ROC), the optimal performance indicates the higher the better; whereas for the UQ metrics (ECE, NLL, brier score), the optimal performance indicates the lower the better. The values on each bar plotted on all the subplots show the mean of the values, and the error bars show the range of values obtained from all the different experimental seeds. The deterministic, Bayesian, and non-Bayesian methods are plotted in blue, green and red, respectively.
For each of the three immunogenic datasets, generally UQ methods showed superior performances compared to the deterministic model across nearly all performance and uncertainty metrics. The superior performance is more obvious in Bacteria dataset, underscoring the benefits of uncertainty-aware modeling. A more detailed explanation is provided in the following.
For Immuno-Virus dataset, the Ensemble model outperforms other models in terms of all metrics except Recall and ECE where SWAG and LA yield the best performance and calibrated predictions respectively.
In evaluating the Immuno-Bacteria dataset, SWAG stands out for its dominant predictive performance, achieving the best results across all relevant metrics. However, in terms of uncertainty quantification, the Ensemble model provides the most reliable uncertainty quantification, as reflected in its leading ECE, NLL, and Brier score results.
For the Immuno-Tumor dataset, SWAG, and Ensemble consistently demonstrate the most effective predictive performance, achieving the highest scores across all performance metrics. However, regarding uncertainty quantification, the Ensemble model, similar to Bacteria dataset, showed the superior performance over all uncertainty metrics.
Finally, we have examined the correlation between the squared error (the difference between the true label and the predicted probability) and the predicted uncertainties for each probabilistic method. Figure 5 summarizes the Spearman correlations across all models. Overall, predicted uncertainties are strongly correlated with prediction error for most methods and datasets, with the exception of the Tumor dataset, where SWAG and MCD exhibit noticeably weaker correlations. The Virus dataset shows the highest correlations across all models, indicating that uncertainty estimates are most reliable in this setting. On the Bacteria dataset, correlations remain high for nearly all methods (except MCD), but are consistently lower than those observed for Virus. The Tumor dataset yields the weakest correlations overall, highlighting a greater challenge in aligning uncertainty estimates with predictive accuracy. Among the methods, MCD is particularly sensitive to the dataset: while it achieves strong correlations on Virus, its performance degrades substantially on Tumor. This underscores how the quality of the posterior approximation can depend on the dataset, potentially due to the restricted expressiveness of the variational family. In contrast, LA, DVBLL, and EDL show a more consistent performance across all datasets. Interestingly, these methods consistently show outstanding predictive performance, but they do not necessarily achieve the best UQ performance as reported in [Figs. 2–4](#iqag003-F2 iqag003-F3 iqag003-F4).
Spearman correlation, ρ between squared error and the predicted uncertainty. In nearly all methods the predicted uncertainty has a high correlation with error, with DVBLL, EDL, and LA demonstrating a more consistent performance across datasets.
In summary, among all datasets, the Ensemble model consistently showed a superior uncertainty quantification capability, while having comparable performance to the best performing models. Also, over all metrics, probabilistic models outperformed their deterministic alternative as expected.
OOD (out-of-distribution) results
Apart from the evaluation of models on test datasets, trained on respective data sources; we also report the out-of-distribution evaluation results based on the following three settings (test data source train data source). [Figures 6–11](#iqag003-F6 iqag003-F7 iqag003-F8 iqag003-F9 iqag003-F10 iqag003-F11) show the evaluative results for different OOD evaluative scenarios with visualizations which show the rank of different models based on their respective performance on that metric. Values displayed on top of each bar show the mean of values and the error bars show the range of values obtained from all the different experimental seeds.
Out-of-distribution results on Immuno-Bacteria dataset of models trained with Immuno-Virus dataset. For the predictive metrics (accuracy, precision, recall, F1-Score, AUC ROC), the optimal performance indicates the higher the better; whereas for the UQ metrics (ECE, NLL, brier score), the optimal performance indicates the lower the better. The values on each bar plotted on all the subplots show the mean of the values, and the error bars show the range of values obtained from all the different experimental seeds. The deterministic, Bayesian, and non-Bayesian methods are plotted in blue, green, and red, respectively.
Out-of-distribution results on Immuno-Tumor dataset of models trained with Immuno-Virus dataset. For the predictive metrics (accuracy, precision, recall, F1-Score, AUC ROC), the optimal performance indicates the higher the better; whereas for the UQ metrics (ECE, NLL, brier score), the optimal performance indicates the lower the better. The values on each bar plotted on all the subplots show the mean of the values, and the error bars show the range of values obtained from all the different experimental seeds. The deterministic, Bayesian, and non-Bayesian methods are plotted in blue, green, and red, respectively.
Out-of-distribution results on Immuno-Virus dataset of models trained with Immuno-Bacteria dataset. For the predictive metrics (accuracy, precision, recall, F1-Score, AUC ROC), the optimal performance indicates the higher the better; whereas for the UQ metrics (ECE, NLL, brier score), the optimal performance indicates the lower the better. The values on each bar plotted on all the subplots show the mean of the values, and the error bars show the range of values obtained from all the different experimental seeds. The deterministic, Bayesian, and non-Bayesian methods are plotted in blue, green, and red, respectively.
Out-of-distribution results on Immuno-Tumor dataset of models trained with Immuno-Bacteria dataset. For the predictive metrics (accuracy, precision, recall, F1-Score, AUC ROC), the optimal performance indicates the higher the better; whereas for the UQ metrics (ECE, NLL, brier score), the optimal performance indicates the lower the better. The values on each bar plotted on all the subplots show the mean of the values, and the error bars show the range of values obtained from all the different experimental seeds. The deterministic, Bayesian, and non-Bayesian methods are plotted in blue, green, and red, respectively.
Out-of-distribution results on Immuno-Virus dataset of models trained with Immuno-Tumor dataset. For the predictive metrics (accuracy, precision, recall, F1-Score, AUC ROC), the optimal performance indicates the higher the better; whereas for the UQ metrics (ECE, NLL, brier score), the optimal performance indicates the lower the better. The values on each bar plotted on all the subplots show the mean of the values, and the error bars show the range of values obtained from all the different experimental seeds. The deterministic, Bayesian, and non-Bayesian methods are plotted in blue, green, and red, respectively.
Out-of-distribution results on Immuno-Bacteria dataset of models trained with Immuno-Tumor dataset. For the predictive metrics (accuracy, precision, recall, F1-Score, AUC ROC), the optimal performance indicates the higher the better; whereas for the UQ metrics (ECE, NLL, brier score), the optimal performance indicates the lower the better. The values on each bar plotted on all the subplots show the mean of the values, and the error bars show the range of values obtained from all the different experimental seeds. The deterministic, Bayesian, and non-Bayesian methods are plotted in blue, green, and red, respectively.
Generalization performance of models trained on immuno-virus dataset: Figs. 6 and 7 show the evaluation of models on Immuno-Bacteria and Immuno-Tumor datasets accordingly, while they were trained on Immuno-Virus dataset. Overall, the results underscore the superiority of probabilistic models, with most such methods achieving better scores than deterministic alternatives on almost all performance and uncertainty metrics. In detail, Ensemble and SGLD show superior predictive performance on Bacteria dataset under OOD setting, while EDL and Ensemble outperform other methods concerning the uncertainty quantification metrics. Interestingly, unlike the ID setting, Ensemble model lagged behind other methods in both predictive and uncertainty metrics on Tumor dataset under OOD setting, except for ECE, where it outperformed other methods. EDL shows strong uncertainty quantification depicted by its superior performance on uncertainty metrics, while with respect to predictive performance, EDL, SVDKL, and DVBLL outperform other methods depending on the metric. In summary, under OOD setting, when the models are trained on Immuno-Virus dataset, generally EDL, SWAG, and Ensemble show a superior performance to other models.
Generalization performance of models trained on immuno-bacteria dataset: Figs. 8 and 9 show the evaluation of models on Immuno-Virus and Immuno-Tumor datasets accordingly, while they were trained on Immuno-Bacteria dataset. In both datasets, EDL and Ensemble consistently show a superior performance over other methods. Particularly, in Virus dataset, EDL outperforms all other methods with respect to all performance and uncertainty metrics except for ECE where TS has the best performance. On Tumor dataset, Ensemble and EDL outperform all methods with respect to performance metrics, depending on the metric. Concerning the uncertainty quantification, Ensemble has the best uncertainty quantification performance, evidenced by its superior performance across all uncertainty metrics.
Generalization performance of models trained on immuno-tumor dataset: Figs. 10 and 11 show the evaluation of models on Immuno-Virus and Immuno-Bacteria datasets accordingly, while they were trained on Immuno-Tumor dataset. On both datasets, EDL consistently shows a superior uncertainty quantification performance depicted over all uncertainty metrics. However, concerning the predictive performance, it is more difficult to draw any conclusions about the best performing model. In detail, on Virus dataset, Ensemble, SWAG and EDL show a superior performance over other methods depending on the metric. On Bacteria dataset, SGLD more consistently show a better predictive performance, outperforming other methods with respect to most metrics but Accuracy and Recall, where EDL and SVDKL outperform other methods respectively.
Summary on OOD performance: in summary, Ensemble and EDL show a greater consistency and robustness in uncertainty quantification under OOD setting. Concerning the predictive performance, although EDL, SWAG, and Ensemble depict a superior performance on different datasets, there is no method clearly outperforming others under OOD setting. However, the clear conclusion to draw from these results is that the deterministic model is almost always outperformed by probabilistic methods, highlighting its limited generalizability and reliability.
PLM embedding comparison
Our ImmUQBench platform facilitates the evaluation and comparison of not only different UQ strategies, but also different model components for immunogenicity prediction. As part of this, we evaluate embeddings extracted from different protein language models that have been developed and improved by different research groups.
[Tables 1–3](#iqag003-T1 iqag003-T2 iqag003-T3) show the comparative performance among models with different PLMs for extracting amino acid sequence embeddings.
Across all three immunogenic datasets, and for nearly every performance metric considered, deterministic models consistently demonstrated inferior performance compared to other models. The only exception to this trend was observed on the Immuno-Virus dataset when protein sequences were embedded using the ESM-2 PLM, concerning the ECE and Brier score metric, and on the Immuno-Tumor dataset with ESM-2 embedded sequences, concerning NLL and Brier score. Among the various models investigated, SGLD consistently emerged as a strong performer. Plus SGLD, SWAG, EDL, and LA also exhibited competitive results, suggesting their potential for accurate immunogenicity prediction. To summarize, advantages over uncertainty-aware models over deterministic models persist irrespective of the protein language model used, underscoring compatibility with advances in protein representation learning.
Discussion
All the abovementioned experimental results consistently show that uncertainty-aware models outperform deterministic baselines in immunogenicity prediction, offering gains in both predictive accuracy and calibration. These improvements were observed across all in-distribution datasets and persisted across different protein language model embeddings, suggesting that the benefits of UQ are largely independent of the upstream sequence representation. Enhanced calibration, most notably achieved by SWAG, EDL, LA, and Ensemble, has particular relevance in high-stakes biomedical applications, where overconfident errors can result in costly experimental misallocation or safety risks.
Performance differences became more nuanced across the out-of-distribution evaluation scenarios. EDL and Ensemble demonstrated strong calibration robustness under distribution shift, whereas SWAG, SGLD, and Ensembles along EDL often achieved higher predictive performance in specific scenarios. This indicates a trade-off between reliability and sharpness that should be aligned with downstream objectives.
Overall, empirical findings establish UQ as a means of achieving more reliable and effective immunogenicity prediction and provide actionable guidance for selecting model–task configurations in both experimental and clinical settings.
Related works
Protein language models: Advances in deep learning and the emergence of large language models, including specialized protein language models (PLMs), have revolutionized computational biology by offering accurate, generalizable, and scalable solutions to complex downstream tasks such as vaccinology, drug discovery, immunogenicity prediction, and therapeutic design. DeepNetBim [4] employs a hybrid architecture combining convolutional neural networks with attention mechanisms to integrate sequence features and network centrality metrics for predicting HLA–peptide binding affinity and immunogenicity. ImmugenX [53] introduces a modular PLM-based pipeline to predict immunogenic CD8^+^ epitopes, a task central to personalized immunotherapy. DeepHLApan [54] uses bi-directional GRUs with attention to jointly model HLA–peptide binding and immunogenicity for neoantigen discovery. UnifyImmun [55] adopts a transformer-based framework with dual encoders and cross-attention to simultaneously model HLA–peptide and peptide–TCR interactions, offering improved generalization and interpretability.
Uncertainty quantification: Despite the success of deep learning and large language models (LLMs) across numerous domains, these models often suffer from overconfident predictions and, in the case of LLMs, hallucinations. This has motivated the development of uncertainty quantification (UQ) techniques to assess the reliability of model outputs. Subspace Inference [56] constructs Bayesian posteriors in low-dimensional subspaces of model parameters, enabling efficient inference and calibrated uncertainty estimates. Contextual Dropout [57] learns data-dependent dropout probabilities, offering both improved predictive performance and uncertainty estimation. Laplace-LoRA [58] applies Laplace approximation over low-rank adaptation parameters in a post hoc manner, allowing for efficient posterior estimation after fine-tuning. BLoB [59] formulates a Bayesian low-rank adaptation framework by jointly estimating mean and covariance during fine-tuning. Contextual LoRA [60] further extends this by incorporating contextual uncertainty modules that dynamically adjust aleatoric uncertainty on a per-sample basis.
Existing benchmarks: Several benchmark studies have been developed to evaluate the performance of PLMs across a wide range of biological tasks, offering insights into their generalization and transfer learning capabilities. However, few benchmarks have explicitly investigated their behavior under uncertainty or assessed their reliability in critical biomedical applications. PEER [8] provides a comprehensive multi-task evaluation framework across protein function, localization, structure, and molecular interaction tasks, comparing traditional methods and PLMs. PETA [61] evaluates 13 PLMs with varying vocabulary sizes and tokenization strategies across 15 downstream tasks, shedding light on the impact of subword versus. amino-acid-level tokenization on PLM performance. The authors of [62] benchmarked multiple UQ methods on protein fitness regression tasks under various distributional shifts, revealing key trade-offs between calibration, accuracy, and data efficiency.
Conclusion & limitations
Conclusion: In this study, we have introduced ImmUQBench, a new benchmark for evaluating a range of uncertainty quantification (UQ) methods on the task of immunogenicity prediction. This benchmark demonstrates that UQ methods deliver benefits extending beyond calibration, consistently enhancing predictive performance across diverse datasets and embedding strategies. By leveraging multi-modal information and incorporating it through a state-of-the-art backbone model, our benchmark enables comprehensive evaluation under both in-distribution and out-of-distribution settings. Our results demonstrate that most UQ methods consistently outperform the deterministic baseline across various metrics. In particular, Ensemble, EDL, SWAG, and Laplace Approximation (LA) exhibit superior performance in terms of predictive accuracy, uncertainty estimation, and generalization. While some UQ methods occasionally underperform relative to the deterministic model, overall, UQ-based approaches yield more robust and calibrated predictions, especially in out-of-distribution scenarios. ImmUQBench—to the best of our knowledge, the first UQ benchmark for immunogenicity prediction—offers a valuable resource for developing more reliable and uncertainty-aware antigen design tools.
Limitations & future work: Our proposed ImmUQBench provides a targeted evaluation of selected Bayesian and non-Bayesian UQ methods. However, it does not explore the broader design space, including variations in backbone architectures—e.g. PLGDL which integrates both Protein Language Models and Geometric Deep Learning models [63]—or uncertainty propagation beyond the prediction head—e.g. integrate uncertainty in protein representation [64] into prediction of protein properties. Also, the current benchmark only addresses epistemic uncertainty; the consideration of uncertainty within protein representations and its effect on immunogenicity prediction still remains an open research endeavor.
Supplementary Material
iqag003_Supplementary_Data
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Doneva N , Doytchinova I, Dimitrov I. Predicting immunogenicity risk in biopharmaceuticals. Symmetry 2021;13:388.
- 2Adu-Bobie J , Capecchi B, Serruto D et al Two years into reverse vaccinology. Vaccine 2003;21:605–10.12531326 10.1016/s 0264-410x(02)00566-2 · doi ↗ · pubmed ↗
- 3Li G , Iyer B, Prasath VBS et al Deepimmuno: deep learning-empowered prediction and generation of immunogenic peptides for t-cell immunity. Brief Bioinform 2021;22:bbab 160.34009266 10.1093/bib/bbab 160PMC 8135853 · doi ↗ · pubmed ↗
- 4Yang X , Zhao L, Wei F, Li J. Deep Net Bim: deep learning model for predicting HLA-epitope interactions based on network analysis by harnessing binding and immunogenicity information. BMC Bioinform 2021;22:231.May
- 5Touvron H , Lavril T, Izacard G et al Llama: Open and efficient foundation language models, 2023.
- 6Devlin J , Chang M-W, Lee K et al Bert: pre-training of deep bidirectional transformers for language understanding, 2019.
- 7Brown TB , Mann B, Ryder N et al Language models are few-shot learners, 2020.
- 8Xu M , Zhang Z, Lu J et al Peer: a comprehensive and multi-task benchmark for protein sequence understanding, 2022.
