TL;DR
This paper introduces Self-Calibrated Consistency (SCC), a test-time defense method that enhances the adversarial robustness of vision-language models like CLIP by leveraging semantic and spatial consistency during inference.
Contribution
The paper proposes SCC, a novel plug-and-play inference strategy that improves zero-shot adversarial robustness of CLIP without requiring additional training or labeled data.
Findings
SCC significantly boosts CLIP's robustness across 22 benchmarks.
SCC maintains high accuracy while defending against diverse adversarial attacks.
The method can be integrated with other vision-language models for improved robustness.
Abstract
Pre-trained vision-language models (VLMs) such as CLIP have demonstrated strong zero-shot capabilities across diverse domains, yet remain highly vulnerable to adversarial perturbations that disrupt image-text alignment and compromise reliability. Existing defenses typically rely on adversarial fine-tuning with labeled data, limiting their applicability in zero-shot settings. In this work, we identify two key weaknesses of current CLIP adversarial attacks -- lack of semantic guidance and vulnerability to view variations -- collectively termed semantic and viewpoint fragility. To address these challenges, we propose Self-Calibrated Consistency (SCC), an effective test-time defense. SCC consists of two complementary modules: Semantic consistency, which leverages soft pseudo-labels from counterattack warm-up and multi-view predictions to regularize cross-modal alignment and separate the…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper crisply diagnoses why test-time counterattack-style defenses fail: semantic drift toward hard negatives and view sensitivity, and thus designs self-calibrated consistency to directly counter those issues. The proposed self-calibrated consistency method is training-free and plug-and-play. For experiments, robustness gains are large and consistent across natural-image datasets and medical sets, while clean accuracy is essentially unchanged Ablations and sensitivity plots are informativ
Experiments focus on zero-shot classification, while the claims about broader VLM use (retrieval, open-vocab detection/segmentation) are mentioned rather than demonstrated. I do suggest the authors explore other VLM applications in addition to classification. Note that the hyperparameters are tuned by grid search, while practical guidance beyond the reported defaults is limited. Attack coverage is narrow: results are mostly under PGD-10 and a small CW budget. There’s no AutoAttack, Square, or t
1. The motivation is well illustrated with empirical analyses across diverse datasets (see Figure 1). 2. Experiments across different datasets show the efficacy of the proposed method in terms of clean and robust accuracy. 3. Theoretical analyses are given to justify the effectiveness of the proposed semantic consistency method. The proofs seem to be correct.
1. The paper is not organized well. The authors introduce the concept of counterattack in the Introduction (the first section), yet the formal definition of test-time counterattack is given in Section 3.1. I also find it hard to understand the definition of $\hat{z}$, is it logit? 2. The authors mentioned that the paper evaluates adaptive attacks, but not too much information is given regarding adaptive attacks. What if the adversarial attackers know the original images instead of the counter-at
1. The authors evaluate their approach on multiple benchmarks and conduct extensive ablation studies. 2. The experimental setup is described in a detailed and thorough manner.
1. Lack of Evaluation Against Adaptive Attacks: The current experiments rely on standard white-box attacks (PGD, CW) which are unaware of the SCC defense mechanism. A strong adversary could potentially design an adaptive attac that incorporates the entire SCC process (including the pseudo-label generation and multi-view optimization) into its own loss function to bypass the defense. 2. Although the paper makes a significant contribution to improving the adversarial robustness of CLIP and relate
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
