Sample selection from a given dataset to validate machine learning   models

Bertrand Iooss (EDF R&D PRISME; GdR MASCOT-NUM)

arXiv:2104.14401·stat.ML·April 30, 2021·1 cites

Sample selection from a given dataset to validate machine learning models

Bertrand Iooss (EDF R&D PRISME, GdR MASCOT-NUM)

PDF

Open Access

TL;DR

This paper proposes a statistical approach using support points and Maximum Mean Discrepancy to select validation datasets for supervised machine learning, demonstrated through an industrial case study.

Contribution

It introduces a novel method for dataset selection based on design of experiments and support points, enhancing validation in industrial machine learning applications.

Findings

01

Support points effectively select validation datasets.

02

Method improves validation accuracy and reliability.

03

Industrial case study confirms practical benefits.

Abstract

The selection of a validation basis from a full dataset is often required in industrial use of supervised machine learning algorithm. This validation basis will serve to realize an independent evaluation of the machine learning model. To select this basis, we propose to adopt a "design of experiments" point of view, by using statistical criteria. We show that the "support points" concept, based on Maximum Mean Discrepancy criteria, is particularly relevant. An industrial test case from the company EDF illustrates the practical interest of the methodology.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Statistical Process Monitoring · Fault Detection and Control Systems · Industrial Vision Systems and Defect Detection