EdgeMM: Multi-Core CPU with Heterogeneous AI-Extension and Activation-aware Weight Pruning for Multimodal LLMs at Edge
Kangbo Bai, Le Ye, Ru Huang, Tianyu Jia

TL;DR
EdgeMM introduces a multi-core CPU architecture with heterogeneous AI extensions and activation-aware weight pruning, significantly improving the performance of multimodal large language models at the edge.
Contribution
This work presents a novel multi-core CPU design with heterogeneous AI accelerators and dynamic pruning techniques tailored for multimodal LLMs at edge devices.
Findings
Achieves 2.84x speedup over laptop GPU
Enhances bandwidth efficiency and core utilization
Demonstrates effectiveness on commercial 22nm technology
Abstract
Emerging multimodal LLMs (MLLMs) exhibit strong cross-modality perception and reasoning capabilities and hold great potential for various applications at edge. However, MLLMs typically consist of a compute-intensive modality encoder and a memory-bound LLM decoder, leading to distinct bottlenecks for hardware designs. In this work, we present a multi-core CPU solution with heterogeneous AI extensions, which are based on either the compute-centric systolic array or memory-centric digital compute-in-memory (CIM) co-processors. In addition, dynamic activation-aware weight pruning and bandwidth management are developed to enhance bandwidth efficiency and core utilization, improving overall performance. We implemented our solution using commercial 22nm technology. For representative MLLMs, our evaluations show EdgeMM can achieve 2.84x performance speedup compared to laptop 3060 GPU.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Natural Language Processing Techniques
