High Throughput Training of Deep Surrogates from Large Ensemble Runs

Lucas Meyer (DATAMOVE; SINCLAIR AI Lab; EDF R&D); Marc Schouler; (DATAMOVE ); Robert Alexander Caulk (DATAMOVE ); Alejandro Rib\'es (EDF R&D),; Bruno Raffin (DATAMOVE )

arXiv:2309.16743·cs.LG·October 2, 2023

High Throughput Training of Deep Surrogates from Large Ensemble Runs

Lucas Meyer (DATAMOVE, SINCLAIR AI Lab, EDF R&D), Marc Schouler, (DATAMOVE ), Robert Alexander Caulk (DATAMOVE ), Alejandro Rib\'es (EDF R&D),, Bruno Raffin (DATAMOVE )

PDF

TL;DR

This paper introduces an open-source framework for online training of deep surrogate models using large ensemble simulation data, significantly improving training efficiency and accuracy.

Contribution

It presents a novel streaming training framework that leverages parallelism and a reservoir buffer to efficiently train deep surrogates from large datasets.

Findings

01

Enabled training on 8TB of data in 2 hours

02

Achieved 47% accuracy improvement

03

Increased batch throughput by 13 times

Abstract

Recent years have seen a surge in deep learning approaches to accelerate numerical solvers, which provide faithful but computationally intensive simulations of the physical world. These deep surrogates are generally trained in a supervised manner from limited amounts of data slowly generated by the same solver they intend to accelerate. We propose an open-source framework that enables the online training of these models from a large ensemble run of simulations. It leverages multiple levels of parallelism to generate rich datasets. The framework avoids I/O bottlenecks and storage issues by directly streaming the generated data. A training reservoir mitigates the inherent bias of streaming while maximizing GPU throughput. Experiment on training a fully connected network as a surrogate for the heat equation shows the proposed approach enables training on 8TB of data in 2 hours with an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.