EdgeMM: Multi-Core CPU with Heterogeneous AI-Extension and Activation-aware Weight Pruning for Multimodal LLMs at Edge

Kangbo Bai; Le Ye; Ru Huang; Tianyu Jia

arXiv:2505.10782·cs.AR·May 19, 2025

EdgeMM: Multi-Core CPU with Heterogeneous AI-Extension and Activation-aware Weight Pruning for Multimodal LLMs at Edge

Kangbo Bai, Le Ye, Ru Huang, Tianyu Jia

PDF

Open Access

TL;DR

EdgeMM introduces a multi-core CPU architecture with heterogeneous AI extensions and activation-aware weight pruning, significantly improving the performance of multimodal large language models at the edge.

Contribution

This work presents a novel multi-core CPU design with heterogeneous AI accelerators and dynamic pruning techniques tailored for multimodal LLMs at edge devices.

Findings

01

Achieves 2.84x speedup over laptop GPU

02

Enhances bandwidth efficiency and core utilization

03

Demonstrates effectiveness on commercial 22nm technology

Abstract

Emerging multimodal LLMs (MLLMs) exhibit strong cross-modality perception and reasoning capabilities and hold great potential for various applications at edge. However, MLLMs typically consist of a compute-intensive modality encoder and a memory-bound LLM decoder, leading to distinct bottlenecks for hardware designs. In this work, we present a multi-core CPU solution with heterogeneous AI extensions, which are based on either the compute-centric systolic array or memory-centric digital compute-in-memory (CIM) co-processors. In addition, dynamic activation-aware weight pruning and bandwidth management are developed to enhance bandwidth efficiency and core utilization, improving overall performance. We implemented our solution using commercial 22nm technology. For representative MLLMs, our evaluations show EdgeMM can achieve 2.84x performance speedup compared to laptop 3060 GPU.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Natural Language Processing Techniques