TL;DR
This paper introduces ACD-CLIP, a novel framework that enhances zero-shot anomaly detection by jointly refining feature representations and dynamic cross-modal fusion, significantly improving performance on industrial and medical benchmarks.
Contribution
It proposes a co-designed architecture with Conv-LoRA for local bias injection and a Dynamic Fusion Gateway for adaptive multimodal fusion, addressing key limitations of pre-trained vision-language models.
Findings
Achieves superior accuracy on diverse benchmarks.
Demonstrates robustness in industrial and medical anomaly detection.
Validates the importance of joint feature refinement and dynamic fusion.
Abstract
Pre-trained Vision-Language Models (VLMs) struggle with Zero-Shot Anomaly Detection (ZSAD) due to a critical adaptation gap: they lack the local inductive biases required for dense prediction and employ inflexible feature fusion paradigms. We address these limitations through an Architectural Co-Design framework that jointly refines feature representation and cross-modal fusion. Our method proposes a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts, enabling a powerful bidirectional fusion. Extensive experiments on diverse industrial and medical benchmarks demonstrate superior accuracy and robustness, validating that this synergistic co-design is critical for robustly adapting foundation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
