VTFusion: A Vision-Text Multimodal Fusion Network for Few-Shot Anomaly Detection
Yuxin Jiang, Yunkang Cao, Yuqi Cheng, Yiheng Zhang, Weiming Shen

TL;DR
VTFusion introduces a multimodal fusion network that leverages domain-specific vision-text features and a dedicated fusion module to improve few-shot anomaly detection in industrial settings.
Contribution
The paper presents a novel vision-text fusion framework with adaptive feature extractors and a specialized fusion module tailored for industrial anomaly detection.
Findings
Achieves 96.8% AUROC in 2-shot MVTec AD dataset
Attains 86.2% AUROC on VisA dataset
Reaches 93.5% AUPRO on industrial automotive parts dataset
Abstract
Few-Shot Anomaly Detection (FSAD) has emerged as a critical paradigm for identifying irregularities using scarce normal references. While recent methods have integrated textual semantics to complement visual data, they predominantly rely on features pre-trained on natural scenes, thereby neglecting the granular, domain-specific semantics essential for industrial inspection. Furthermore, prevalent fusion strategies often resort to superficial concatenation, failing to address the inherent semantic misalignment between visual and textual modalities, which compromises robustness against cross-modal interference. To bridge these gaps, this study proposes VTFusion, a vision-text multimodal fusion framework tailored for FSAD. The framework rests on two core designs. First, adaptive feature extractors for both image and text modalities are introduced to learn task-specific representations,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
