The infrastructure powering IBM's Gen AI model development
Talia Gershon, Seetharami Seelam, Brian Belgodere, Milton Bonilla, Lan, Hoang, Danny Barnett, I-Hsin Chung, Apoorve Mohan, Ming-Hung Chen, Lixiang, Luo, Robert Walkup, Constantinos Evangelinos, Shweta Salaria, Marc Dombrowa,, Yoonho Park, Apo Kayi, Liran Schour, Alim Alim

TL;DR
This paper describes IBM's hybrid cloud infrastructure, Vela and Blue Vela, designed to support large-scale generative AI model training with high performance, flexibility, and future-proofing, integrating hardware, software, and telemetry.
Contribution
It introduces IBM's integrated AI infrastructure solutions, Vela and Blue Vela, optimized for scalable, efficient, and adaptable large-scale AI model training.
Findings
Vela offers scalable, distributed AI training capabilities.
Blue Vela enables rapid development of large AI models.
The infrastructure supports both internal innovation and commercial deployment.
Abstract
AI Infrastructure plays a key role in the speed and cost-competitiveness of developing and deploying advanced AI models. The current demand for powerful AI infrastructure for model training is driven by the emergence of generative AI and foundational models, where on occasion thousands of GPUs must cooperate on a single training job for the model to be trained in a reasonable time. Delivering efficient and high-performing AI training requires an end-to-end solution that combines hardware, software and holistic telemetry to cater for multiple types of AI workloads. In this report, we describe IBM's hybrid cloud infrastructure that powers our generative AI model development. This infrastructure includes (1) Vela: an AI-optimized supercomputing capability directly integrated into the IBM Cloud, delivering scalable, dynamic, multi-tenant and geographically distributed infrastructure for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems · Evolutionary Algorithms and Applications · Digital Transformation in Industry
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
