Mapping Gemma3 onto an Edge Dataflow Architecture
Shouyu Du, Miaoxiang Yu, Zhenyu Xu, Zhiheng Ni, Jillian Cai, Qing Yang, Tao Wei

TL;DR
This paper demonstrates the first end-to-end deployment of large language and vision models on an edge dataflow architecture, introducing hardware-aware optimizations that significantly improve speed and power efficiency.
Contribution
It presents novel hardware-aware techniques for deploying large models on tiled edge NPUs, including optimized kernels and quantization methods, providing a blueprint for future edge AI deployments.
Findings
Up to 5.2x faster prefill and 4.8x faster decoding compared to iGPU
Power efficiency improved by up to 67.2x and 222.9x over iGPU and CPU
Demonstrates practical low-power inference of large models at the edge
Abstract
We present the first end-to-end deployment of the Gemma3 family of large language and vision models on a tiled edge dataflow architecture (AMD Ryzen AI NPU). Our work introduces a set of hardware-aware techniques. For prefill, we introduce an efficient dequantization engine, optimize tiled matrix multiplication kernels, and propose FlowQKV, a chunked, pipelined attention mechanism. For decoding, we introduce FusedDQP, which fuses dequantization and projection into a single kernel, and FlowKV, which re-structures attention to sustain high memory bandwidth utilization. Together with a compact Q4NX 4-bit quantization format, these methods yield up to faster prefill and faster decoding versus the iGPU, and and over the CPU, respectively. Power efficiency improves by as much as and compared to the iGPU and CPU. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Embedded Systems Design Techniques
