Hummingbird: A Smaller and Faster Large Language Model Accelerator on Embedded FPGA
Jindong Li, Tenglong Li, Ruiqi Chen, Guobin Shen, Dongcheng Zhao, Qian Zhang, Yi Zeng

TL;DR
Hummingbird is a compact, energy-efficient FPGA accelerator tailored for large language model inference on embedded devices, achieving high throughput and supporting larger models than previous solutions.
Contribution
It introduces a smaller, more powerful FPGA-based LLM accelerator that overcomes memory constraints and improves performance for embedded applications.
Findings
Achieves 4.8 and 8.6 tokens/sec on LLaMA3-8B for KV260 and ZCU104.
Uses 67% LUT, 39% DSP, and 42% power of existing solutions.
Supports longer contexts and larger models on embedded FPGAs.
Abstract
Deploying large language models (LLMs) on embedded devices remains a significant research challenge due to the high computational and memory demands of LLMs and the limited hardware resources available in such environments. While embedded FPGAs have demonstrated performance and energy efficiency in traditional deep neural networks, their potential for LLM inference remains largely unexplored. Recent efforts to deploy LLMs on FPGAs have primarily relied on large, expensive cloud-grade hardware and have only shown promising results on relatively small LLMs, limiting their real-world applicability. In this work, we present Hummingbird, a novel FPGA accelerator designed specifically for LLM inference on embedded FPGAs. Hummingbird is smaller, targeting embedded FPGAs such as the KV260 and ZCU104 with 67% LUT, 39% DSP, and 42% power savings over existing research. Hummingbird is stronger,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques
