Hierarchically Robust Zero-shot Vision-language Models

Junhao Dong; Yifei Zhang; Hao Zhu; Yew-Soon Ong; Piotr Koniusz

arXiv:2604.18867·cs.CV·April 22, 2026

Hierarchically Robust Zero-shot Vision-language Models

Junhao Dong, Yifei Zhang, Hao Zhu, Yew-Soon Ong, Piotr Koniusz

PDF

TL;DR

This paper introduces a hierarchical adversarial fine-tuning framework for vision-language models, improving robustness against adversarial attacks while leveraging class hierarchy structures.

Contribution

It proposes a novel hierarchical embedding alignment method with mechanisms to control embedding depth, enhancing robustness and semantic diversity in vision-language models.

Findings

01

Improved adversarial robustness demonstrated across multiple datasets.

02

Theoretical link established between embedding depth and margin size.

03

Aligning over multiple hierarchies boosts semantic variety.

Abstract

Vision-Language Models (VLMs) can perform zero-shot classification but are susceptible to adversarial attacks. While robust fine-tuning improves their robustness, existing approaches align fixed text embeddings with an image embedding, sacrificing natural performance and robustness. A robustness degradation also occurs when a model faces adversarial attacks targeting superclasses (parent classes, e.g., mammal) in addition to their base (leaf) classes (e.g., cat). Thus, to enhance adversarial robustness and leverage the inherent hierarchical properties of class space, we propose a novel adversarial fine-tuning framework based on hierarchical embeddings and several levels of adversarially robust alignment of image-text modalities. Additional mechanisms place visual embeddings at the desired depth of hierarchy, and we provide a theoretical connection between the depth of embedding in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.