TL;DR
This paper introduces ET3, a training-free, energy-minimizing test-time transformation that significantly improves adversarial robustness of large vision-language models across various tasks.
Contribution
Proposes ET3, a theoretically grounded, lightweight defense method that enhances robustness of LVLMs without additional training.
Findings
ET3 improves robustness against adversarial attacks.
ET3 enhances zero-shot classification with CLIP.
ET3 boosts robustness in tasks like image captioning and VQA.
Abstract
Despite the rapid progress in multimodal models and Large Visual-Language Models (LVLM), they remain highly susceptible to adversarial perturbations, raising serious concerns about their reliability in real-world use. While adversarial training has become the leading paradigm for building models that are robust to adversarial attacks, Test-Time Transformations (TTT) have emerged as a promising strategy to boost robustness at inference. In light of this, we propose Energy-Guided Test-Time Transformation (ET3), a lightweight, training-free defense that enhances the robustness by minimizing the energy of the input samples. Our method is grounded in a theory that proves our transformation succeeds in classification under reasonable assumptions. We present extensive experiments demonstrating that ET3 provides a strong defense for classifiers, zero-shot classification with CLIP, and also for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
