A Scalable NorthPole System with End-to-End Vertical Integration for Low-Latency and Energy-Efficient LLM Inference
Michael V. DeBole, Rathinakumar Appuswamy, Neil McGlohon, Brian Taba, Steven K. Esser, Filipp Akopyan, John V. Arthur, Arnon Amir, Alexander Andreopoulos, Peter J. Carlson, Andrew S. Cassidy, Pallab Datta, Myron D. Flickner, Rajamohan Gandhasri, Guillaume J. Garreau, Megumi Ito

TL;DR
This paper presents a scalable, energy-efficient, end-to-end system integrating neural inference accelerators, training, and runtime to enable low-latency large language model inference in data centers.
Contribution
It introduces a novel vertically integrated system architecture combining hardware, software, and training for scalable LLM inference.
Findings
Achieves 115 peta-ops at 4-bit precision with low power consumption.
Supports multiple large models simultaneously with low latency.
Demonstrates scalability and reconfigurability for various model sizes.
Abstract
A vertically integrated, end-to-end, research prototype system combines 288 NorthPole neural inference accelerator cards, offline training algorithms, a high-performance runtime stack, and a containerized inference pipeline to deliver a scalable and efficient cloud inference service. The system delivers 115 peta-ops at 4-bit integer precision and 3.7 PB/s of memory bandwidth across 18 2U servers, while consuming only 30 kW of power and weighing 730 kg in a 0.67 m^2 42U rack footprint. The system can run 3 simultaneous instances of the 8-billion-parameter open-source IBM Granite-3.3-8b-instruct model at 2,048 context length with 28 simultaneous users and a per-user inter-token latency of 2.8 ms. The system is scalable, modular, and reconfigurable, supporting various model sizes and context lengths, and is ideal for deploying agentic workflows for enterprise AI applications in existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Parallel Computing and Optimization Techniques
