EMIT: Enhancing MLLMs for Industrial Anomaly Detection via Difficulty-Aware GRPO

Wei Guan; Jun Lan; Jian Cao; Hao Tan; Huijia Zhu; Weiqiang Wang

arXiv:2507.21619·cs.CV·July 30, 2025

EMIT: Enhancing MLLMs for Industrial Anomaly Detection via Difficulty-Aware GRPO

Wei Guan, Jun Lan, Jian Cao, Hao Tan, Huijia Zhu, Weiqiang Wang

PDF

TL;DR

This paper introduces EMIT, a framework that improves multimodal large language models for industrial anomaly detection by using difficulty-aware training strategies and domain-specific data augmentation.

Contribution

EMIT is the first to incorporate difficulty-aware group relative policy optimization for enhancing MLLMs in industrial anomaly detection tasks.

Findings

01

Achieves 7.77% average performance improvement on MMAD benchmark.

02

Effectively utilizes GPT-generated descriptions for missing defective images.

03

Enhances few-shot anomaly detection with soft prompts and contrastive embeddings.

Abstract

Industrial anomaly detection (IAD) plays a crucial role in maintaining the safety and reliability of manufacturing systems. While multimodal large language models (MLLMs) show strong vision-language reasoning abilities, their effectiveness in IAD remains limited without domain-specific adaptation. In this work, we propose EMIT, a unified framework that enhances MLLMs for IAD via difficulty-aware group relative policy optimization (GRPO). EMIT constructs a multi-task IAD dataset and utilizes GPT-generated object text descriptions to compensate for missing defective images. For few-shot anomaly detection, it integrates a soft prompt and heatmap-guided contrastive embeddings derived from patch-level comparisons. To better handle difficult data samples, i.e., cases where the MLLM struggles to generate correct answers, we propose a difficulty-aware GRPO that extends the original GRPO by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.