Probabilistic Robustness for Free? Revisiting Training via a Benchmark
Yi Zhang, Zheng Wang, Zhen Chen, Wenjie Ruan, Qing Guo, Siddartha Khastgir, Carsten Maple, Xingyu Zhao

TL;DR
This paper introduces PRBench, a comprehensive benchmark for evaluating probabilistic robustness in deep learning, comparing various training methods and providing insights into their effectiveness and generalization capabilities.
Contribution
The paper presents PRBench, the first dedicated benchmark for probabilistic robustness, and offers a unified framework for comparing training methods and analyzing their generalization performance.
Findings
AT methods improve both AR and PR performance more broadly.
PR-targeted training methods achieve lower generalization error and higher clean accuracy.
PRBench provides a comprehensive evaluation of robustness training methods across multiple datasets and models.
Abstract
Deep learning models are notoriously vulnerable to imperceptible perturbations. Most existing research centers on adversarial robustness (AR), which evaluates models under worst-case scenarios by examining the existence of deterministic adversarial examples (AEs). In contrast, probabilistic robustness (PR) adopts a statistical perspective, measuring the probability that predictions remain correct under stochastic perturbations. While PR is widely regarded as a practical complement to AR, dedicated training methods for improving PR are still relatively underexplored, albeit with emerging progress. Among the few PR-targeted training methods, we identify three limitations: i non-comparable evaluation protocols; ii limited comparisons to strong AT baselines despite anecdotal PR gains from AT; and iii no unified framework to compare the generalization of these methods. Thus, we introduce…
Peer Reviews
Decision·Submitted to ICLR 2026
This paper provides a comprehensive and systematic analysis of probabilistic robustness (PR) across a wide variety of datasets and model architectures. Specifically, the benchmark includes diverse datasets such as CIFAR-10, CIFAR-100, MNIST, SVHN, and Tiny-ImageNet, and evaluates a broad set of models including DeiT-S, DeiT-T, ResNet-18, ResNet-34, Simple-CNN, VGG-19, ViT-B, ViT-S, and WRN-28-10. The paper further makes a novel contribution by introducing PRBench, the first dedicated framework
(Minor issue only) The paper defines GE as “the difference between the natural and empirical risk.” I suggest including its mathematical formulation, similar to Equation (4), for clarity and consistency. While abbreviations such as AT and GE are widely used, the paper employs too many acronyms (e.g., AT, PR, GE, RT, …), which can hinder readability. I recommend adding a table summarizing all abbreviations for convenience. Several prior works are closely related to this paper in terms of GE an
- The distinction and relationship between AR and PR are of great importance to the community. The claim that standard AT is a superior method for achieving both AR and PR is a convincing finding. - I appreciate the extensive workload and great efforts in building this benchmark. I believe it would have a significant influence on this field and be very useful for the following studies. - The paper is well-written, also providing a solid theoretical analysis to explain the observed differences
- I'm a little confused by the title and conclusion that PR comes "for free". The results clearly show that this "free" PR is paid for with a significant drop in clean accuracy. So isn't this a complex, multi-objective trade-off, instead of a "free lunch"? - The PR-targeted methods evaluated are relatively simple. The finding is more accurately "strong AT is better than simple RT," which is less surprising. Further including more advanced PR-targeted techniques might be an effective refinement.
1. The author conducts extensive experiments to demonstrate that AT outperforms PR-targeted training in adversarial and probabilistic robustness but results in a higher generalization error. The evidence derived from various model architectures and standard datasets is consistent and compelling. 2. The author comprehensively illustrates the multi-fold trade off among (PR, AR), GE, and efficiency, which may illuminate future research to appropriately handle them.
1. There are some confused descriptions in the relationship among p, f, and L, for instance, in Eq. 1, p is a softmax function outside the model f, but in Thm.1, the project p is an inner composition of the model f. Besides, the author seems mistakenly using the model f as the loss function L in Proofs located in appendix F. 5. 2. The author overlooks a critical assumption for Lemma 2 that p_i>0 must hold when for l_i=1. If l_i=1 and p_i=0, the gradient of the cross-entropy loss would be infi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)
