Vision-Language In-Context Learning Driven Few-Shot Visual Inspection Model
Shiryu Ueno, Yoshikazu Hayashi, Shunsuke Nakatsuka, Yusei Yamada,, Hiroaki Aizawa, Kunihito Kato

TL;DR
This paper introduces a novel few-shot visual inspection method using Vision-Language Models with in-context learning, enabling high-performance defect detection without extensive retraining for new products.
Contribution
It presents a fine-tuned VLM with in-context learning for visual inspection, reducing the need for large datasets and retraining for each new product.
Findings
Achieved MCC of 0.804 on MVTec AD in one-shot setting.
F1-score of 0.950 demonstrating high defect detection accuracy.
Eliminated the need for extensive retraining for new inspection tasks.
Abstract
We propose general visual inspection model using Vision-Language Model~(VLM) with few-shot images of non-defective or defective products, along with explanatory texts that serve as inspection criteria. Although existing VLM exhibit high performance across various tasks, they are not trained on specific tasks such as visual inspection. Thus, we construct a dataset consisting of diverse images of non-defective and defective products collected from the web, along with unified formatted output text, and fine-tune VLM. For new products, our method employs In-Context Learning, which allows the model to perform inspections with an example of non-defective or defective image and the corresponding explanatory texts with visual prompts. This approach eliminates the need to collect a large number of training samples and re-train the model for each product. The experimental results show that our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Infrastructure Maintenance and Monitoring · Multimodal Machine Learning Applications
