BitSkip: An Empirical Analysis of Quantization and Early Exit Composition in Transformers
Ramshankar Bhuvaneswaran, Handan Liu

TL;DR
BitSkip systematically explores quantization and early exit strategies in transformers, revealing that simple 8-bit quantization without complex transforms can outperform more intricate methods and even rival full-precision models in language modeling tasks.
Contribution
Introduces BitSkip, a hybrid framework for analyzing the interactions of quantization and early exit techniques in transformers, highlighting the surprising effectiveness of simple 8-bit quantization.
Findings
8-bit quantized model without Hadamard outperforms 4-bit and Hadamard-enhanced models.
Hadamard transforms at 8-bit cause severe training instability, degrading performance by over 37,000%.
Layer 18 offers a 32.5% speed gain with only 4% quality loss.
Abstract
The pursuit of efficient Large Language Models (LLMs) has led to increasingly complex techniques like extreme quantization and dynamic routing. While individual benefits of these methods are well-documented, their compositional effects remain poorly understood. This paper introduces BitSkip, a hybrid architectural framework for systematically exploring these interactions. Counter-intuitively, our findings reveal that a simple 8-bit quantized model without Hadamard transform (BitSkip-V1) not only outperforms its more complex 4-bit and Hadamard-enhanced counterparts but also competes the full-precision baseline in quality (perplexity of 1.13 vs 1.19) . The introduction of Hadamard transforms, even at 8-bit precision, catastrophically degraded performance by over 37,000%, tracing fundamental training instability. Our BitSkip-V1 recipe demonstrates superior early-exit characteristics, with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
