Harmonia: Algorithm-Hardware Co-Design for Memory- and Compute-Efficient BFP-based LLM Inference
Xinyu Wang, Jieyu Li, Yanan Sun, Weifeng He

TL;DR
Harmonia is a co-designed algorithm-hardware framework that enables efficient, all-layer block floating point activations in large language models, significantly reducing memory and computation costs while maintaining accuracy.
Contribution
It introduces a novel co-design approach that extends BFP to all layers, including attention, with hardware support for mixed formats and aggressive cache compression.
Findings
Achieves up to 5.05x higher area efficiency
Improves energy efficiency by up to 3.90x
Provides up to 4.62x speedup on LLM inference
Abstract
Large Language Models (LLMs) are powerful but incur high memory and computation costs. Quantization is an effective solution, with INT weights and FP activations being widely adopted to preserve accuracy. Prior works further reduce FP overhead by using block floating point (BFP) activations in linear layers, but fail to extend BFP to attention layers due to severe accuracy degradation, limiting overall efficiency. To address this challenge, we propose Harmonia, an algorithm-hardware co-design framework that enables all-layer BFP activations with a configurable hardware architecture. First, we systematically explore BFP configurations to achieve a better trade-off between accuracy and activation compression across all layers. Second, to reduce KV-cache storage and computation in attention layers, we introduce an asymmetric bit-allocation strategy and computations in attention layers,we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Parallel Computing and Optimization Techniques · Natural Language Processing Techniques
