APSQ: Additive Partial Sum Quantization with Algorithm-Hardware   Co-Design

Yonghao Tan; Pingcheng Dong; Yongkun Wu; Yu Liu; Xuejiao Liu; Peng; Luo; Shih-Yang Liu; Xijie Huang; Dong Zhang; Luhong Liang; Kwang-Ting Cheng

arXiv:2505.03748·cs.AR·May 8, 2025

APSQ: Additive Partial Sum Quantization with Algorithm-Hardware Co-Design

Yonghao Tan, Pingcheng Dong, Yongkun Wu, Yu Liu, Xuejiao Liu, Peng, Luo, Shih-Yang Liu, Xijie Huang, Dong Zhang, Luhong Liang, Kwang-Ting Cheng

PDF

Open Access

TL;DR

This paper introduces APSQ, a novel quantization method that reduces energy consumption in DNN accelerators by efficiently compressing partial sums, with demonstrated benefits on NLP, CV, and large language models.

Contribution

APSQ integrates PSUM quantization into the model compression framework, enabling nearly lossless performance and significant energy savings across various neural network architectures.

Findings

01

Achieves nearly lossless quantization on NLP and CV tasks.

02

Reduces energy costs by 28-87% in DNN accelerators.

03

Demonstrates effectiveness on large language models like LLaMA2-7B.

Abstract

DNN accelerators, significantly advanced by model compression and specialized dataflow techniques, have marked considerable progress. However, the frequent access of high-precision partial sums (PSUMs) leads to excessive memory demands in architectures utilizing input/weight stationary dataflows. Traditional compression strategies have typically overlooked PSUM quantization, which may account for 69% of power consumption. This study introduces a novel Additive Partial Sum Quantization (APSQ) method, seamlessly integrating PSUM accumulation into the quantization framework. A grouping strategy that combines APSQ with PSUM quantization enhanced by a reconfigurable architecture is further proposed. The APSQ performs nearly lossless on NLP and CV tasks across BERT, Segformer, and EfficientViT models while compressing PSUMs to INT8. This leads to a notable reduction in energy costs by 28-87%.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Parallel Computing and Optimization Techniques · Embedded Systems Design Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · Dropout · Layer Normalization · Attention Dropout · Softmax · Residual Connection · WordPiece · Linear Layer