Ensuring Dataset Quality for Machine Learning Certification
Sylvaine Picard, Camille Chapdelaine, Cyril Cappi, Laurent Gardes,, Eric Jenn, Baptiste Lef\`evre, Thomas Soumarmon

TL;DR
This paper proposes a dataset specification and verification process tailored for ML in safety-critical systems, addressing gaps in existing standards and providing practical recommendations for dataset management.
Contribution
It introduces a novel dataset specification and verification process specifically designed for ML safety-critical applications, filling a gap in current standards.
Findings
Applied the process to a railway signal recognition system
Provided a list of dataset collection and management recommendations
Contributed to the development of dataset engineering for safety-critical ML
Abstract
In this paper, we address the problem of dataset quality in the context of Machine Learning (ML)-based critical systems. We briefly analyse the applicability of some existing standards dealing with data and show that the specificities of the ML context are neither properly captured nor taken into ac-count. As a first answer to this concerning situation, we propose a dataset specification and verification process, and apply it on a signal recognition system from the railway domain. In addi-tion, we also give a list of recommendations for the collection and management of datasets. This work is one step towards the dataset engineering process that will be required for ML to be used on safety critical systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
