Harmonia: Algorithm-Hardware Co-Design for Memory- and Compute-Efficient BFP-based LLM Inference

Xinyu Wang; Jieyu Li; Yanan Sun; Weifeng He

arXiv:2602.04595·cs.AR·February 24, 2026

Harmonia: Algorithm-Hardware Co-Design for Memory- and Compute-Efficient BFP-based LLM Inference

Xinyu Wang, Jieyu Li, Yanan Sun, Weifeng He

PDF

Open Access

TL;DR

Harmonia is a co-designed algorithm-hardware framework that enables efficient, all-layer block floating point activations in large language models, significantly reducing memory and computation costs while maintaining accuracy.

Contribution

It introduces a novel co-design approach that extends BFP to all layers, including attention, with hardware support for mixed formats and aggressive cache compression.

Findings

01

Achieves up to 5.05x higher area efficiency

02

Improves energy efficiency by up to 3.90x

03

Provides up to 4.62x speedup on LLM inference

Abstract

Large Language Models (LLMs) are powerful but incur high memory and computation costs. Quantization is an effective solution, with INT weights and FP activations being widely adopted to preserve accuracy. Prior works further reduce FP overhead by using block floating point (BFP) activations in linear layers, but fail to extend BFP to attention layers due to severe accuracy degradation, limiting overall efficiency. To address this challenge, we propose Harmonia, an algorithm-hardware co-design framework that enables all-layer BFP activations with a configurable hardware architecture. First, we systematically explore BFP configurations to achieve a better trade-off between accuracy and activation compression across all layers. Second, to reduce KV-cache storage and computation in attention layers, we introduce an asymmetric bit-allocation strategy and computations in attention layers,we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Parallel Computing and Optimization Techniques · Natural Language Processing Techniques