Sanitizing Manufacturing Dataset Labels Using Vision-Language Models

Nazanin Mahjourian; Vinh Nguyen

arXiv:2506.23465·cs.CV·July 1, 2025

Sanitizing Manufacturing Dataset Labels Using Vision-Language Models

Nazanin Mahjourian, Vinh Nguyen

PDF

Open Access

TL;DR

This paper presents VLSR, a vision-language framework that uses CLIP embeddings to identify and correct noisy labels in manufacturing datasets, improving data quality for machine learning.

Contribution

It introduces a novel vision-language-based approach for label sanitization and refinement in manufacturing datasets, reducing label noise and improving consistency.

Findings

01

Effective identification of irrelevant and misspelled labels

02

Significant reduction in label vocabulary size

03

Enhanced dataset quality for industrial ML applications

Abstract

The success of machine learning models in industrial applications is heavily dependent on the quality of the datasets used to train the models. However, large-scale datasets, specially those constructed from crowd-sourcing and web-scraping, often suffer from label noise, inconsistencies, and errors. This problem is particularly pronounced in manufacturing domains, where obtaining high-quality labels is costly and time-consuming. This paper introduces Vision-Language Sanitization and Refinement (VLSR), which is a vision-language-based framework for label sanitization and refinement in multi-label manufacturing image datasets. This method embeds both images and their associated textual labels into a shared semantic space leveraging the CLIP vision-language model. Then two key tasks are addressed in this process by computing the cosine similarity between embeddings. First, label…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Text and Document Classification Technologies · Domain Adaptation and Few-Shot Learning