TL;DR
This paper introduces AD-FM, a multimodal large language model framework that enhances anomaly detection by multi-stage reasoning and fine-grained reward optimization, improving domain adaptation and detection accuracy.
Contribution
The paper presents a novel multi-stage reasoning process and a fine-grained reward mechanism to better adapt multimodal LLMs for specialized anomaly detection tasks.
Findings
Significant accuracy improvements on industrial datasets.
Effective adaptation of general vision-language models to anomaly detection.
Enhanced supervision over reasoning processes leads to better detection of subtle defects.
Abstract
While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities across diverse domains, their application to specialized anomaly detection (AD) remains constrained by domain adaptation challenges. Existing Group Relative Policy Optimization (GRPO) based approaches suffer from two critical limitations: inadequate training data utilization when models produce uniform responses, and insufficient supervision over reasoning processes that encourage immediate binary decisions without deliberative analysis. We propose a comprehensive framework addressing these limitations through two synergistic innovations. First, we introduce a multi-stage deliberative reasoning process that guides models from region identification to focused examination, generating diverse response patterns essential for GRPO optimization while enabling structured supervision over analytical workflows.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
