Sample selection from a given dataset to validate machine learning models
Bertrand Iooss (EDF R&D PRISME, GdR MASCOT-NUM)

TL;DR
This paper proposes a statistical approach using support points and Maximum Mean Discrepancy to select validation datasets for supervised machine learning, demonstrated through an industrial case study.
Contribution
It introduces a novel method for dataset selection based on design of experiments and support points, enhancing validation in industrial machine learning applications.
Findings
Support points effectively select validation datasets.
Method improves validation accuracy and reliability.
Industrial case study confirms practical benefits.
Abstract
The selection of a validation basis from a full dataset is often required in industrial use of supervised machine learning algorithm. This validation basis will serve to realize an independent evaluation of the machine learning model. To select this basis, we propose to adopt a "design of experiments" point of view, by using statistical criteria. We show that the "support points" concept, based on Maximum Mean Discrepancy criteria, is particularly relevant. An industrial test case from the company EDF illustrates the practical interest of the methodology.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Statistical Process Monitoring · Fault Detection and Control Systems · Industrial Vision Systems and Defect Detection
