InspectVLM: Unified in Theory, Unreliable in Practice
Conor Wallace, Isaac Corley, Jonathan Lwowski

TL;DR
This paper evaluates the practical viability of unified vision-language models in industrial inspection, revealing significant limitations in robustness, accuracy, and reliability compared to traditional specialized models.
Contribution
It introduces InspectVLM trained on a new large-scale inspection dataset and critically assesses its performance, highlighting key shortcomings in real-world industrial applications.
Findings
InspectVLM performs well on classification and keypoint tasks.
It underperforms traditional models in core inspection metrics.
The model shows brittle behavior and unreliable outputs in critical scenarios.
Abstract
Unified vision-language models (VLMs) promise to streamline computer vision pipelines by reframing multiple visual tasks such as classification, detection, and keypoint localization within a single language-driven interface. This architecture is particularly appealing in industrial inspection, where managing disjoint task-specific models introduces complexity, inefficiency, and maintenance overhead. In this paper, we critically evaluate the viability of this unified paradigm using InspectVLM, a Florence-2-based VLM trained on InspectMM, our new large-scale multimodal, multitask inspection dataset. While InspectVLM performs competitively on image-level classification and structured keypoint tasks, we find that it fails to match traditional ResNet-based models in core inspection metrics. Notably, the model exhibits brittle behavior under low prompt variability, produces degenerate outputs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
