ZSG-IAD: A Multimodal Framework for Zero-Shot Grounded Industrial Anomaly Detection
Qiuhui Chen, Jiaxiang Song, Shuai Tan, and Weimin Zhong

TL;DR
ZSG-IAD is a multimodal framework that enables zero-shot industrial anomaly detection with grounded, interpretable results using vision-language features and structured reasoning.
Contribution
It introduces a novel multimodal, zero-shot anomaly detection framework with a language-guided grounding module and verifiable rewards for reliable, explainable industrial defect detection.
Findings
Strong zero-shot performance across multiple benchmarks
Produces transparent, physically grounded anomaly masks
Outperforms prior methods in interpretability and accuracy
Abstract
Deep learning-based industrial anomaly detectors often behave as black boxes, making it hard to justify decisions with physically meaningful defect evidence. We propose ZSG-IAD, a multimodal vision-language framework for zero-shot grounded industrial anomaly detection. Given RGB images, sensor images, and 3D point clouds, ZSG-IAD generates structured anomaly reports and pixel-level anomaly masks. ZSG-IAD introduces a language-guided two-hop grounding module: (1) anomaly-related sentences select evidence-like latent slots distilled from multimodal features, yielding coarse spatial support; (2) selected slots modulate feature maps via channel-spatial gating and a lightweight decoder to produce fine-grained masks. To improve reliability, we further apply Executable-Rule GRPO with verifiable rewards to promote structured outputs, anomaly-region consistency, and reasoning-conclusion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
