A novel algorithm can generate data to train machine learning models in conditions of extreme scarcity of real world data
Olivier Niel

TL;DR
This paper introduces a genetic algorithm-based method to generate artificial datasets for training machine learning models, especially effective when real data is scarce or costly to obtain.
Contribution
The paper presents a novel genetic algorithm approach to generate large artificial datasets that improve model training in data-scarce scenarios.
Findings
Generated data achieves comparable accuracy to real data when abundant.
Generated data outperforms scarce real data in extreme scarcity conditions.
Method reduces need for costly real-world data collection.
Abstract
Training machine learning models requires large datasets. However, collecting, curating, and operating large and complex sets of real world data poses problems of costs, ethical and legal issues, and data availability. Here we propose a novel algorithm to generate large artificial datasets to train machine learning models in conditions of extreme scarcity of real world data. The algorithm is based on a genetic algorithm, which mutates randomly generated datasets subsequently used for training a neural network. After training, the performance of the neural network on a batch of real world data is considered a surrogate for the fitness of the generated dataset used for its training. As selection pressure is applied to the population of generated datasets, unfit individuals are discarded, and the fitness of the fittest individuals increases through generations. The performance of the data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in cancer detection · Machine Learning in Healthcare · Radiomics and Machine Learning in Medical Imaging
