EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models
Jinane Bazzi, Mariam Rakka, Fadi Kurdahi, Mohammed E. Fouda, Ahmed Eltawil

TL;DR
EdgeCIM introduces a hardware-software co-design with CIM macro and tile-based strategy, significantly improving throughput and energy efficiency for small language model inference on edge devices.
Contribution
The paper presents a novel CIM-based accelerator framework optimized for end-to-end decoder inference, with extensive design exploration and real-world benchmarks.
Findings
Up to 7.3x higher throughput compared to NVIDIA Orin Nano.
49.59x better energy efficiency on LLaMA3.2-1B.
Average of 336.42 tokens/sec and 173.02 tokens/J under INT4 precision.
Abstract
The growing demand for deploying Small Language Models (SLMs) on edge devices, including laptops, smartphones, and embedded platforms, has exposed fundamental inefficiencies in existing accelerators. While GPUs handle prefill workloads efficiently, the autoregressive decoding phase is dominated by GEMV operations that are inherently memory-bound, resulting in poor utilization and prohibitive energy costs at the edge. In this work, we present EdgeCIM, a hardware-software co-design framework that rethinks accelerator design for end-to-end decoder-only inference. At its core is a CIM macro, implemented in 65nm, coupled with a tile-based mapping strategy that balances pipeline stages, maximizing parallelism while alleviating DRAM bandwidth bottlenecks. Our simulator enables design space exploration of SLMs up to 4B parameters, identifying Pareto-optimal configurations in terms of latency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
