IANUS: Integrated Accelerator based on NPU-PIM Unified Memory System

Minseok Seo; Xuan Truong Nguyen; Seok Joong Hwang; Yongkee Kwon,; Guhyun Kim; Chanwook Park; Ilkon Kim; Jaehan Park; Jeongbin Kim; Woojae Shin,; Jongsoon Won; Haerang Choi; Kyuyoung Kim; Daehan Kwon; Chunseok Jeong,; Sangheon Lee; Yongseok Choi; Wooseok Byun; Seungcheol Baek; Hyuk-Jae Lee,; John Kim

arXiv:2410.15008·cs.AR·October 22, 2024

IANUS: Integrated Accelerator based on NPU-PIM Unified Memory System

Minseok Seo, Xuan Truong Nguyen, Seok Joong Hwang, Yongkee Kwon,, Guhyun Kim, Chanwook Park, Ilkon Kim, Jaehan Park, Jeongbin Kim, Woojae Shin,, Jongsoon Won, Haerang Choi, Kyuyoung Kim, Daehan Kwon, Chunseok Jeong,, Sangheon Lee, Yongseok Choi, Wooseok Byun, Seungcheol Baek

PDF

TL;DR

IANUS is a novel integrated accelerator combining NPU and PIM with a unified memory system to efficiently accelerate end-to-end LLM inference, significantly outperforming existing GPU and accelerator solutions.

Contribution

The paper introduces IANUS, a domain-specific architecture that unifies NPU and PIM with a shared memory system and novel scheduling to enhance LLM inference performance.

Findings

01

IANUS improves GPT-2 inference speed by 6.2x over NVIDIA A100.

02

It achieves a 3.2x average speedup compared to the state-of-the-art accelerator.

03

Prototype implementation demonstrates feasibility with commercial PIM, NPU, and FPGA.

Abstract

Accelerating end-to-end inference of transformer-based large language models (LLMs) is a critical component of AI services in datacenters. However, diverse compute characteristics of end-to-end LLM inference present challenges as previously proposed accelerators only address certain operations or stages (e.g., self-attention, generation stage, etc.). To address the unique challenges of accelerating end-to-end inference, we propose IANUS -- Integrated Accelerator based on NPU-PIM Unified Memory System. IANUS is a domain-specific system architecture that combines a Neural Processing Unit (NPU) with a Processing-in-Memory (PIM) to leverage both the NPU's high computation throughput and the PIM's high effective memory bandwidth. In particular, IANUS employs a unified main memory system where the PIM memory is used both for PIM operations and for NPU's main memory. The unified main memory…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.