TL;DR
AD-Copilot is a specialized multimodal language model designed for industrial anomaly detection, utilizing visual in-context comparison and a novel dataset to outperform existing models and even surpass human experts.
Contribution
The paper introduces AD-Copilot, a new vision-language model with a comparison encoder and a large-scale industrial dataset, advancing IAD performance beyond prior models.
Findings
Achieves 82.3% accuracy on MMAD benchmark
Outperforms all models without data leakage
Surpasses human experts on several IAD tasks
Abstract
Multimodal Large Language Models (MLLMs) have achieved impressive success in natural visual understanding, yet they consistently underperform in industrial anomaly detection (IAD). This is because MLLMs trained mostly on general web data differ significantly from industrial images. Moreover, they encode each image independently and can only compare images in the language space, making them insensitive to subtle visual differences that are key to IAD. To tackle these issues, we present AD-Copilot, an interactive MLLM specialized for IAD via visual in-context comparison. We first design a novel data curation pipeline to mine inspection knowledge from sparsely labeled industrial images and generate precise samples for captioning, VQA, and defect localization, yielding a large-scale multimodal dataset Chat-AD rich in semantic signals for IAD. On this foundation, AD-Copilot incorporates a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
