Ouroboros: Wafer-Scale SRAM CIM with Token-Grained Pipelining for Large Language Model Inference
Yiqi Liu, Yudong Pan, Mengdi Wang, Shixin Zhao, Haonan Zhu, Yinhe Han, Lei Zhang, and Ying Wang

TL;DR
Ouroboros is a wafer-scale SRAM CIM architecture for large language model inference that reduces energy and latency by executing operations in situ and employs innovative pipelining, cache management, and core mapping techniques.
Contribution
It introduces token-grained pipelining, distributed dynamic KV cache management, and communication-aware core mapping for wafer-scale SRAM CIM in LLM inference.
Findings
Achieves 4.1x throughput improvement
Achieves 4.2x energy efficiency gain
Peaks at 9.1x throughput and 17x energy efficiency for 13B model
Abstract
Conventional LLM inference architectures suffer from high energy and latency due to frequent data movement across memory hierarchies. We propose Ouroboros, a wafer-scale SRAM-based Computing-in-Memory (CIM) architecture that executes all operations in situ, eliminating off-chip migration. To maximize its limited first-level capacity, we introduce three innovations: Token-Grained Pipelining: Replaces sequence-level pipelining to mitigate length variations, boosting utilization and reducing activation storage. Distributed Dynamic KV Cache Management: Decouples memory from compute to leverage fragmented SRAM for efficient KV storage. Communication-Aware Mapping: Optimizes core allocation for locality and fault tolerance across the wafer. Experimental results show Ouroboros achieves average gains of in throughput and in energy efficiency, peaking at …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Advanced Neural Network Applications
