TerEffic: Highly Efficient Ternary LLM Inference on FPGA

Chenyang Yin; Zhenyu Bai; Pranav Venkatram; Shivam Aggarwal; Zhaoying; Li; and Tulika Mitra

arXiv:2502.16473·cs.AR·May 2, 2025

TerEffic: Highly Efficient Ternary LLM Inference on FPGA

Chenyang Yin, Zhenyu Bai, Pranav Venkatram, Shivam Aggarwal, Zhaoying, Li, and Tulika Mitra

PDF

Open Access

TL;DR

TerEffic introduces a specialized FPGA architecture for ternary quantized LLM inference, achieving high throughput and energy efficiency on edge devices, significantly outperforming existing hardware like NVIDIA Jetson and A100.

Contribution

The paper presents a novel FPGA-based system optimized for ternary quantized LLM inference, enabling flexible, high-performance, and energy-efficient deployment on edge hardware.

Findings

01

Achieves 16,300 tokens/sec on 370M-parameter model

02

Provides 192x throughput improvement over NVIDIA Jetson Orin Nano

03

Attains 8x power efficiency gain over NVIDIA A100

Abstract

Deploying Large Language Models (LLMs) efficiently on edge devices is often constrained by limited memory capacity and high power consumption. Low-bit quantization methods, particularly ternary quantization, have demonstrated significant potential in preserving model accuracy while substantially decreasing memory footprint and computational costs. However, existing general-purpose architectures and accelerators have not fully exploited the advantages of low-bit quantization due to insufficient specialized hardware support. We introduce TerEffic, an FPGA-based architecture tailored for ternary-quantized LLM inference. The proposed system offers flexibility through reconfigurable hardware to meet various system requirements. We evaluated two representative configurations: a fully on-chip design that stores all weights within on-chip memories, scaling out using multiple FPGAs, and an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications