Confidence-Calibrated Adversarial Training: Generalizing to Unseen Attacks
David Stutz, Matthias Hein, Bernt Schiele

TL;DR
This paper introduces confidence-calibrated adversarial training (CCAT), which improves robustness against unseen attack types by biasing models towards low confidence predictions and allowing rejection of uncertain examples.
Contribution
The authors propose CCAT, a novel adversarial training method that enhances generalization to unseen attacks by incorporating confidence calibration and rejection mechanisms.
Findings
CCAT improves robustness against various unseen attack norms.
CCAT achieves better clean accuracy compared to standard adversarial training.
Developed new attack methods to evaluate confidence-based defenses.
Abstract
Adversarial training yields robust models against a specific threat model, e.g., adversarial examples. Typically robustness does not generalize to previously unseen threat models, e.g., other norms, or larger perturbations. Our confidence-calibrated adversarial training (CCAT) tackles this problem by biasing the model towards low confidence predictions on adversarial examples. By allowing to reject examples with low confidence, robustness generalizes beyond the threat model employed during training. CCAT, trained only on adversarial examples, increases robustness against larger , , and attacks, adversarial frames, distal adversarial examples and corrupted examples and yields better clean accuracy compared to adversarial training. For thorough evaluation we developed novel white- and black-box attacks directly attacking CCAT by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Anomaly Detection Techniques and Applications
MethodsTest
