Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection
Yuanting Fan, Jun Liu, Xiaochen Chen, Bin-Bin Gao, Jian Li, Yong Liu, Jinlong Peng, Chengjie Wang

TL;DR
This paper introduces a multi-level fine-grained semantic captioning approach to improve anomaly localization in few-shot settings by better aligning textual descriptions with visual regions.
Contribution
It proposes the MFSC method for automatic multi-level textual descriptions and the FineGrainedAD framework with MLLP and MLSA components for enhanced anomaly detection.
Findings
Achieves superior performance on MVTec-AD and VisA datasets.
Improves semantic alignment between text and visual regions.
Enhances localization accuracy in few-shot anomaly detection.
Abstract
Few-shot anomaly detection (FSAD) methods identify anomalous regions with few known normal samples. Most existing methods rely on the generalization ability of pre-trained vision-language models (VLMs) to recognize potentially anomalous regions through feature similarity between text descriptions and images. However, due to the lack of detailed textual descriptions, these methods can only pre-define image-level descriptions to match each visual patch token to identify potential anomalous regions, which leads to the semantic misalignment between image descriptions and patch-level visual anomalies, achieving sub-optimal localization performance. To address the above issues, we propose the Multi-Level Fine-Grained Semantic Caption (MFSC) to provide multi-level and fine-grained textual descriptions for existing anomaly detection datasets with automatic construction pipeline. Based on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
