TerEffic: Highly Efficient Ternary LLM Inference on FPGA
Chenyang Yin, Zhenyu Bai, Pranav Venkatram, Shivam Aggarwal, Zhaoying, Li, and Tulika Mitra

TL;DR
TerEffic introduces a specialized FPGA architecture for ternary quantized LLM inference, achieving high throughput and energy efficiency on edge devices, significantly outperforming existing hardware like NVIDIA Jetson and A100.
Contribution
The paper presents a novel FPGA-based system optimized for ternary quantized LLM inference, enabling flexible, high-performance, and energy-efficient deployment on edge hardware.
Findings
Achieves 16,300 tokens/sec on 370M-parameter model
Provides 192x throughput improvement over NVIDIA Jetson Orin Nano
Attains 8x power efficiency gain over NVIDIA A100
Abstract
Deploying Large Language Models (LLMs) efficiently on edge devices is often constrained by limited memory capacity and high power consumption. Low-bit quantization methods, particularly ternary quantization, have demonstrated significant potential in preserving model accuracy while substantially decreasing memory footprint and computational costs. However, existing general-purpose architectures and accelerators have not fully exploited the advantages of low-bit quantization due to insufficient specialized hardware support. We introduce TerEffic, an FPGA-based architecture tailored for ternary-quantized LLM inference. The proposed system offers flexibility through reconfigurable hardware to meet various system requirements. We evaluated two representative configurations: a fully on-chip design that stores all weights within on-chip memories, scaling out using multiple FPGAs, and an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
