# A unified vision-language model for cross-product defect detection in glove manufacturing

**Authors:** Yusen Zhao, Liang Tian, Yonggang Wang

PMC · DOI: 10.1371/journal.pone.0339867 · PLOS One · 2026-02-11

## TL;DR

A new vision-language model can detect defects in glove manufacturing using a single system, reducing costs and complexity.

## Contribution

A unified MLLM with a novel two-stage fine-tuning strategy for scalable defect detection across products.

## Key findings

- The RFT-enhanced MLLM achieves 0.63 mAP, comparable to a specialized YOLO model.
- A single MLLM trained on mixed products maintains 0.61 mAP, showing cross-product adaptability.
- Natural language prompts enable dynamic handling of different defect types and products.

## Abstract

Automated anomaly detection is vital to industrial quality control, yet conventional deep learning detectors often struggle with scalability. These models, typically following a rigid “one-model-per-task” paradigm, require separate systems for each product line, increasing operational complexity and cost in diverse manufacturing environments. To address this limitation, we propose a unified defect detection framework based on a Multimodal Large Language Model (MLLM). Our approach utilizes a two-stage fine-tuning strategy: Supervised Fine-Tuning (SFT) to impart domain-specific knowledge, followed by a novel Reinforcement Fine-Tuning (RFT) process that refines visual reasoning. This RFT stage is guided by a multi-faceted verifiable reward function designed to optimize localization accuracy, classification correctness, and output structure. On a challenging real-world glove manufacturing dataset, our RFT-enhanced MLLM achieves a mean Average Precision (mAP) of 0.63, which is comparable to a highly specialized YOLO baseline (0.62). More importantly, a single, unified MLLM trained on a mixed-product dataset maintains competitive performance (mAP 0.61), demonstrating its ability to dynamically handle different products and defect types via natural language prompts. This study validates the feasibility of using a single, flexible MLLM to replace multiple rigid models in complex industrial inspection, offering a scalable and cost-effective paradigm for future intelligent quality control systems. The open-source code will be released at https://github.com/GloamXun/Glove-MLLM.

## Full-text entities

- **Genes:** FBN2 (fibrillin 2) [NCBI Gene 2201] {aka CCA, DA9, EOMD}
- **Diseases:** occlusions (MESH:D001157), hallucination (MESH:D006212), MLLMs (MESH:D007806)
- **Chemicals:** PVC (MESH:D011143), Nitrile (MESH:D009570), CIoU (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Cell lines:** Qwen2-VL-2B — Bos taurus (Bovine), Transformed cell line (CVCL_C3MC)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12893585/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12893585/full.md

## References

22 references — full list in the complete paper: https://tomesphere.com/paper/PMC12893585/full.md

---
Source: https://tomesphere.com/paper/PMC12893585