Hummingbird: A Smaller and Faster Large Language Model Accelerator on Embedded FPGA

Jindong Li; Tenglong Li; Ruiqi Chen; Guobin Shen; Dongcheng Zhao; Qian Zhang; Yi Zeng

arXiv:2507.03308·cs.AR·October 20, 2025

Hummingbird: A Smaller and Faster Large Language Model Accelerator on Embedded FPGA

Jindong Li, Tenglong Li, Ruiqi Chen, Guobin Shen, Dongcheng Zhao, Qian Zhang, Yi Zeng

PDF

Open Access

TL;DR

Hummingbird is a compact, energy-efficient FPGA accelerator tailored for large language model inference on embedded devices, achieving high throughput and supporting larger models than previous solutions.

Contribution

It introduces a smaller, more powerful FPGA-based LLM accelerator that overcomes memory constraints and improves performance for embedded applications.

Findings

01

Achieves 4.8 and 8.6 tokens/sec on LLaMA3-8B for KV260 and ZCU104.

02

Uses 67% LUT, 39% DSP, and 42% power of existing solutions.

03

Supports longer contexts and larger models on embedded FPGAs.

Abstract

Deploying large language models (LLMs) on embedded devices remains a significant research challenge due to the high computational and memory demands of LLMs and the limited hardware resources available in such environments. While embedded FPGAs have demonstrated performance and energy efficiency in traditional deep neural networks, their potential for LLM inference remains largely unexplored. Recent efforts to deploy LLMs on FPGAs have primarily relied on large, expensive cloud-grade hardware and have only shown promising results on relatively small LLMs, limiting their real-world applicability. In this work, we present Hummingbird, a novel FPGA accelerator designed specifically for LLM inference on embedded FPGAs. Hummingbird is smaller, targeting embedded FPGAs such as the KV260 and ZCU104 with 67% LUT, 39% DSP, and 42% power savings over existing research. Hummingbird is stronger,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques