One Language-Free Foundation Model Is Enough for Universal Vision Anomaly Detection
Bin-Bin Gao, Chengjie Wang

TL;DR
This paper introduces a simple, universal framework for visual anomaly detection that eliminates the need for language encoders and complex adaptation, achieving state-of-the-art results across diverse benchmarks.
Contribution
It proposes a decoupled, parameter-efficient approach that removes reliance on language models for anomaly detection, simplifying and improving the process.
Findings
Outperforms state-of-the-art zero-/few-shot methods
Surpasses full-shot anomaly detection performance
Highly parameter-efficient with only 0.002M learnable parameters
Abstract
Universal visual anomaly detection (AD) aims to identify anomaly images and segment anomaly regions towards open and dynamic scenarios, following zero- and few-shot paradigms without any dataset-specific fine-tuning. We have witnessed significant progress in widely use of visual-language foundational models in recent approaches. However, current methods often struggle with complex prompt engineering, elaborate adaptation modules, and challenging training strategies, ultimately limiting their flexibility and generality. To address these issues, this paper rethinks the fundamental mechanism behind visual-language models for AD and presents an embarrassingly simple, general, and effective framework for Universal vision Anomaly Detection (UniADet). Specifically, we first find language encoder is used to derive decision weights for anomaly classification and segmentation, and then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning
