Hemlet: A Heterogeneous Compute-in-Memory Chiplet Architecture for Vision Transformers with Group-Level Parallelism

Cong Wang; Zexin Fu; Jiayi Huang; Shanshi Huang

arXiv:2511.15397·cs.AR·February 10, 2026

Hemlet: A Heterogeneous Compute-in-Memory Chiplet Architecture for Vision Transformers with Group-Level Parallelism

Cong Wang, Zexin Fu, Jiayi Huang, Shanshi Huang

PDF

Open Access

TL;DR

Hemlet is a scalable, heterogeneous compute-in-memory chiplet architecture that accelerates Vision Transformers efficiently by employing group-level parallelism and system-level dataflow optimizations, achieving significant speedups and high energy efficiency.

Contribution

This work introduces Hemlet, a novel chiplet-based CIM system with group-level parallelism for scalable and efficient ViT acceleration, addressing communication and scalability challenges.

Findings

01

Achieves 2.41x to 5.74x speedup across configurations.

02

Reaches 9.56 TOPS throughput.

03

Energy efficiency of 4.98 TOPS/W.

Abstract

Vision Transformers (ViTs) have established new performance benchmarks in vision tasks such as image recognition and object detection. However, these advancements come with significant demands for memory and computational resources, presenting challenges for hardware deployment. Heterogeneous compute-in-memory (CIM) accelerators have emerged as a promising solution for enabling energy-efficient deployment of ViTs. Despite this potential, monolithic CIM-based designs face scalability issues due to the size limitations of a single chip. To address this challenge, emerging chiplet-based techniques offer a more scalable alternative. However, chiplet designs come with their own costs, as they introduce expensive communication, which can hinder improvements in throughput. This work introduces Hemlet, a heterogeneous CIM chiplet system designed to accelerate ViT workloads. Hemlet enables…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · CCD and CMOS Imaging Sensors · Parallel Computing and Optimization Techniques