ViP$^2$-CLIP: Visual-Perception Prompting with Unified Alignment for Zero-Shot Anomaly Detection
Ziteng Yang, Jingzehua Xu, Yanshu Li, Zepeng Li, Yeqiang Wang, Xinghui Li

TL;DR
ViP$^2$-CLIP introduces a visual-perception prompting mechanism that adaptively generates fine-grained textual prompts from visual context, significantly improving zero-shot anomaly detection across diverse industrial and medical benchmarks.
Contribution
The paper proposes a novel visual-perception prompting method that eliminates manual templates and class-name priors, enhancing zero-shot anomaly detection performance and generalization.
Findings
Achieves state-of-the-art results on 15 benchmarks.
Demonstrates robust cross-domain generalization.
Effectively focuses on precise abnormal regions.
Abstract
Zero-shot anomaly detection (ZSAD) aims to detect anomalies without any target domain training samples, relying solely on external auxiliary data. Existing CLIP-based methods attempt to activate the model's ZSAD potential via handcrafted or static learnable prompts. The former incur high engineering costs and limited semantic coverage, whereas the latter apply identical descriptions across diverse anomaly types, thus fail to adapt to complex variations. Furthermore, since CLIP is originally pretrained on large-scale classification tasks, its anomaly segmentation quality is highly sensitive to the exact wording of class names, severely constraining prompting strategies that depend on class labels. To address these challenges, we introduce ViP-CLIP. The key insight of ViP-CLIP is a Visual-Perception Prompting (ViP-Prompt) mechanism, which fuses global and multi-scale local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
