PacQ: A SIMT Microarchitecture for Efficient Dataflow in Hyper-asymmetric GEMMs
Ruokai Yin, Yuhang Li, Priyadarshini Panda

TL;DR
PacQ is a specialized SIMT microarchitecture that accelerates hyper-asymmetric GEMMs involving low-precision INT weights and high-precision FP activations, significantly improving performance and energy efficiency.
Contribution
The paper introduces a novel microarchitecture, PacQ, with co-optimized dataflow, packing strategies, and a dedicated multiplier unit for efficient hyper-asymmetric GEMMs.
Findings
Achieves up to 1.99x speedup over conventional baselines.
Reduces energy-delay product (EDP) by 81.4%.
Effectively accelerates weight-only quantized LLM inference.
Abstract
Weight-only quantization has been widely explored in large language models (LLMs) to reduce memory storage and data loading overhead. During deployment on single-instruction-multiple-threads (SIMT) architectures, weights are stored in low-precision integer (INT) format, while activations remain in full-precision floating-point (FP) format to preserve inference accuracy. Although memory footprint and data loading requirements for weight matrices are reduced, computation performance gains remain limited due to the need to convert weights back to FP format through unpacking and dequantization before GEMM operations. In this work, we investigate methods to accelerate GEMM operations involving packed low-precision INT weights and high-precision FP activations, defining this as the hyper-asymmetric GEMM problem. Our approach co-optimizes tile-level packing and dataflow strategies for INT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Big Data and Digital Economy
