ZSG-IAD: A Multimodal Framework for Zero-Shot Grounded Industrial Anomaly Detection

Qiuhui Chen; Jiaxiang Song; Shuai Tan; and Weimin Zhong

arXiv:2604.17949·cs.CV·April 21, 2026

ZSG-IAD: A Multimodal Framework for Zero-Shot Grounded Industrial Anomaly Detection

Qiuhui Chen, Jiaxiang Song, Shuai Tan, and Weimin Zhong

PDF

TL;DR

ZSG-IAD is a multimodal framework that enables zero-shot industrial anomaly detection with grounded, interpretable results using vision-language features and structured reasoning.

Contribution

It introduces a novel multimodal, zero-shot anomaly detection framework with a language-guided grounding module and verifiable rewards for reliable, explainable industrial defect detection.

Findings

01

Strong zero-shot performance across multiple benchmarks

02

Produces transparent, physically grounded anomaly masks

03

Outperforms prior methods in interpretability and accuracy

Abstract

Deep learning-based industrial anomaly detectors often behave as black boxes, making it hard to justify decisions with physically meaningful defect evidence. We propose ZSG-IAD, a multimodal vision-language framework for zero-shot grounded industrial anomaly detection. Given RGB images, sensor images, and 3D point clouds, ZSG-IAD generates structured anomaly reports and pixel-level anomaly masks. ZSG-IAD introduces a language-guided two-hop grounding module: (1) anomaly-related sentences select evidence-like latent slots distilled from multimodal features, yielding coarse spatial support; (2) selected slots modulate feature maps via channel-spatial gating and a lightweight decoder to produce fine-grained masks. To improve reliability, we further apply Executable-Rule GRPO with verifiable rewards to promote structured outputs, anomaly-region consistency, and reasoning-conclusion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.