ViP$^2$-CLIP: Visual-Perception Prompting with Unified Alignment for Zero-Shot Anomaly Detection

Ziteng Yang; Jingzehua Xu; Yanshu Li; Zepeng Li; Yeqiang Wang; Xinghui Li

arXiv:2505.17692·cs.CV·October 7, 2025

ViP$^2$-CLIP: Visual-Perception Prompting with Unified Alignment for Zero-Shot Anomaly Detection

Ziteng Yang, Jingzehua Xu, Yanshu Li, Zepeng Li, Yeqiang Wang, Xinghui Li

PDF

TL;DR

ViP$^2$-CLIP introduces a visual-perception prompting mechanism that adaptively generates fine-grained textual prompts from visual context, significantly improving zero-shot anomaly detection across diverse industrial and medical benchmarks.

Contribution

The paper proposes a novel visual-perception prompting method that eliminates manual templates and class-name priors, enhancing zero-shot anomaly detection performance and generalization.

Findings

01

Achieves state-of-the-art results on 15 benchmarks.

02

Demonstrates robust cross-domain generalization.

03

Effectively focuses on precise abnormal regions.

Abstract

Zero-shot anomaly detection (ZSAD) aims to detect anomalies without any target domain training samples, relying solely on external auxiliary data. Existing CLIP-based methods attempt to activate the model's ZSAD potential via handcrafted or static learnable prompts. The former incur high engineering costs and limited semantic coverage, whereas the latter apply identical descriptions across diverse anomaly types, thus fail to adapt to complex variations. Furthermore, since CLIP is originally pretrained on large-scale classification tasks, its anomaly segmentation quality is highly sensitive to the exact wording of class names, severely constraining prompting strategies that depend on class labels. To address these challenges, we introduce ViP $^{2}$ -CLIP. The key insight of ViP $^{2}$ -CLIP is a Visual-Perception Prompting (ViP-Prompt) mechanism, which fuses global and multi-scale local…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.