Hierarchically Robust Zero-shot Vision-language Models
Junhao Dong, Yifei Zhang, Hao Zhu, Yew-Soon Ong, Piotr Koniusz

TL;DR
This paper introduces a hierarchical adversarial fine-tuning framework for vision-language models, improving robustness against adversarial attacks while leveraging class hierarchy structures.
Contribution
It proposes a novel hierarchical embedding alignment method with mechanisms to control embedding depth, enhancing robustness and semantic diversity in vision-language models.
Findings
Improved adversarial robustness demonstrated across multiple datasets.
Theoretical link established between embedding depth and margin size.
Aligning over multiple hierarchies boosts semantic variety.
Abstract
Vision-Language Models (VLMs) can perform zero-shot classification but are susceptible to adversarial attacks. While robust fine-tuning improves their robustness, existing approaches align fixed text embeddings with an image embedding, sacrificing natural performance and robustness. A robustness degradation also occurs when a model faces adversarial attacks targeting superclasses (parent classes, e.g., mammal) in addition to their base (leaf) classes (e.g., cat). Thus, to enhance adversarial robustness and leverage the inherent hierarchical properties of class space, we propose a novel adversarial fine-tuning framework based on hierarchical embeddings and several levels of adversarially robust alignment of image-text modalities. Additional mechanisms place visual embeddings at the desired depth of hierarchy, and we provide a theoretical connection between the depth of embedding in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
