VLMDiff: Leveraging Vision-Language Models for Multi-Class Anomaly Detection with Diffusion
Samet Hicsonmez, Abd El Rahman Shabayek, Djamila Aouada

TL;DR
VLMDiff introduces an unsupervised multi-class anomaly detection framework that combines vision-language models with diffusion techniques, improving detection accuracy without class-specific training.
Contribution
It leverages pre-trained vision-language models to extract captions for normal images, conditioning a latent diffusion model for scalable, multi-class anomaly detection without manual annotations.
Findings
Improves pixel-level PRO metric by up to 25 points on Real-IAD
Outperforms state-of-the-art diffusion-based methods
Achieves robust multi-class anomaly detection without class-specific training
Abstract
Detecting visual anomalies in diverse, multi-class real-world images is a significant challenge. We introduce \ours, a novel unsupervised multi-class visual anomaly detection framework. It integrates a Latent Diffusion Model (LDM) with a Vision-Language Model (VLM) for enhanced anomaly localization and detection. Specifically, a pre-trained VLM with a simple prompt extracts detailed image descriptions, serving as additional conditioning for LDM training. Current diffusion-based methods rely on synthetic noise generation, limiting their generalization and requiring per-class model training, which hinders scalability. \ours, however, leverages VLMs to obtain normal captions without manual annotations or additional training. These descriptions condition the diffusion model, learning a robust normal image feature representation for multi-class anomaly detection. Our method achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning
