Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs

Jihao Xin; Tian Lyu; Qilong Pan; Kesen Wang; Marco Canini

arXiv:2604.09595·cs.DC·April 14, 2026

Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs

Jihao Xin, Tian Lyu, Qilong Pan, Kesen Wang, Marco Canini

PDF

TL;DR

This paper investigates how parameter compression in large language models can cause dimensional misalignment, leading to slower GPU inference, and proposes GAC, a method to optimize for hardware-aligned dimensions to improve speed.

Contribution

The paper introduces GAC, a full-stack approach that re-selects hardware-aligned dimensions during compression to enhance GPU inference speed without sacrificing model quality.

Findings

01

GAC achieves 100% alignment in compressed models.

02

GAC recovers up to 1.5× speedup on Llama-3-8B.

03

Compressed models with GAC maintain comparable quality.

Abstract

Post-training compression reduces LLM parameter counts but often produces irregular tensor dimensions that degrade GPU performance -- a phenomenon we call \emph{dimensional misalignment}. We present a full-stack analysis tracing root causes at three levels: framework, library, and hardware. The key insight is that model inference becomes slower because the resulting dimensions are unfriendly with the GPU execution stack. For example, compressing Llama-3-8B with activation-aware singular value decomposition (ASVD) has 15\% fewer parameters yet runs no faster than the uncompressed baseline, because 95\% of its dimensions are misaligned. We propose \textbf{GAC} (GPU-Aligned Compression), a new compression paradigm that wraps any dimension-reducing compressor and re-selects hardware-aligned dimensions via multi-choice knapsack optimization under the same parameter budget. We evaluate GAC…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.