AI Surrogate Model for Distributed Computing Workloads
David K. Park, Yihui Ren, Ozgur O. Kilic, Tatiana Korchuganova, Sairam, Sri Vatsavai, Joseph Boudreau, Tasnuva Chowdhury, Shengyu Feng, Raees Khan,, Jaehyung Kim, Scott Klasky, Tadashi Maeno, Paul Nilsson, Verena Ingrid, Martinez Outschoorn, Norbert Podhorszki, Frederic Suter

TL;DR
This paper introduces a generative surrogate modeling approach using diffusion models to simulate large-scale scientific workloads, enhancing data privacy and aiding optimization in distributed computing environments.
Contribution
It evaluates four generative models on real-world workload data, identifying TabDDPM as the most suitable for privacy-preserving data simulation.
Findings
SMOTE and TabDDPM produce data nearly indistinguishable from real data
SMOTE has the lowest privacy preservation among tested models
TabDDPM is identified as the best model for privacy-preserving workload data generation
Abstract
Large-scale international scientific collaborations, such as ATLAS, Belle II, CMS, and DUNE, generate vast volumes of data. These experiments necessitate substantial computational power for varied tasks, including structured data processing, Monte Carlo simulations, and end-user analysis. Centralized workflow and data management systems are employed to handle these demands, but current decision-making processes for data placement and payload allocation are often heuristic and disjointed. This optimization challenge potentially could be addressed using contemporary machine learning methods, such as reinforcement learning, which, in turn, require access to extensive data and an interactive environment. Instead, we propose a generative surrogate modeling approach to address the lack of training data and concerns about privacy preservation. We have collected and processed real-world job…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Advanced Data Processing Techniques · Distributed and Parallel Computing Systems
