VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection

Huilin Deng; Hongchen Luo; Wei Zhai; Yang Cao; Yu Kang

arXiv:2409.20146·cs.CV·April 2, 2026

VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection

Huilin Deng, Hongchen Luo, Wei Zhai, Yang Cao, Yu Kang

PDF

TL;DR

This paper introduces VMAD, a novel visual-enhanced multimodal large language model framework for zero-shot industrial anomaly detection, improving fine-grained and open-ended anomaly analysis.

Contribution

The paper proposes VMAD, a framework that integrates visual knowledge and fine-grained perception into MLLMs for enhanced zero-shot anomaly detection.

Findings

01

VMAD outperforms state-of-the-art methods on multiple benchmarks.

02

The Defect-Sensitive Structure Learning scheme improves anomaly discrimination.

03

The Locality-enhanced Token Compression enhances fine-grained detection.

Abstract

Zero-shot anomaly detection (ZSAD) recognizes and localizes anomalies in previously unseen objects by establishing feature mapping between textual prompts and inspection images, demonstrating excellent research value in flexible industrial manufacturing. However, existing ZSAD methods are limited by closed-world settings, struggling to unseen defects with predefined prompts. Recently, adapting Multimodal Large Language Models (MLLMs) for Industrial Anomaly Detection (IAD) presents a viable solution. Unlike fixed-prompt methods, MLLMs exhibit a generative paradigm with open-ended text interpretation, enabling more adaptive anomaly analysis. However, this adaption faces inherent challenges as anomalies often manifest in fine-grained regions and exhibit minimal visual discrepancies from normal samples. To address these challenges, we propose a novel framework VMAD (Visual-enhanced MLLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.