PipeSim: Trace-driven Simulation of Large-Scale AI Operations Platforms

Thomas Rausch; Waldemar Hummer; Vinod Muthusamy

arXiv:2006.12587·cs.DC·June 24, 2020

PipeSim: Trace-driven Simulation of Large-Scale AI Operations Platforms

Thomas Rausch, Waldemar Hummer, Vinod Muthusamy

PDF

Open Access

TL;DR

This paper introduces PipeSim, a trace-driven simulation environment for optimizing large-scale AI operations, enabling tailored operational strategies to improve efficiency and model-specific outcomes.

Contribution

It presents a comprehensive, stochastic simulation model and toolkit based on real IBM AI platform data for evaluating AI workflow management strategies.

Findings

01

Effective simulation of AI pipeline interactions with infrastructure

02

Enables testing of scheduling and resource allocation strategies

03

Supports analysis of AI model metrics like accuracy and fairness

Abstract

Operationalizing AI has become a major endeavor in both research and industry. Automated, operationalized pipelines that manage the AI application lifecycle will form a significant part of tomorrow's infrastructure workloads. To optimize operations of production-grade AI workflow platforms we can leverage existing scheduling approaches, yet it is challenging to fine-tune operational strategies that achieve application-specific cost-benefit tradeoffs while catering to the specific domain characteristics of machine learning (ML) models, such as accuracy, robustness, or fairness. We present a trace-driven simulation-based experimentation and analytics environment that allows researchers and engineers to devise and evaluate such operational strategies for large-scale AI workflow systems. Analytics data from a production-grade AI platform developed at IBM are used to build a comprehensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Data Stream Mining Techniques · Data Visualization and Analytics