TurboESM: Ultra-Efficient 3-Bit KV Cache Quantization for Protein Language Models with Orthogonal Rotation and QJL Correction
Yue Hu, Junqing Wang, Yingchao Liu

TL;DR
TurboESM introduces a 3-bit quantization method for protein language model key-value caches, significantly reducing memory usage while maintaining high accuracy, enabling more efficient inference on single GPUs.
Contribution
It adapts TurboQuant for PLMs, solving RoPE-orthogonal transformation issues, and introduces calibration, LUT, and residual correction techniques for effective 3-bit quantization.
Findings
Achieves 7.1x memory reduction on ESM-2 650M with cosine similarity > 0.96
Provides a 1.96x speedup in KV fetch operations with a fused decode attention kernel
Addresses amino acid vocabulary sparsity, improving quantization robustness.
Abstract
The rapid scaling of Protein Language Models (PLMs) has unlocked unprecedented accuracy in protein structure prediction and design, but the quadratic memory growth of the Key-Value (KV) cache during inference remains a prohibitive barrier for single-GPU deployment and high-throughput generation. While 8-bit quantization is now standard, 3-bit quantization remains elusive due to severe numerical outliers in activations. This paper presents TurboESM, an adaptation of Google's TurboQuant to the PLM domain. We solve the fundamental incompatibility between Rotary Position Embeddings (RoPE) and orthogonal transformations by deriving a RoPE-first rotation pipeline. We introduce a head-wise SVD calibration method tailored to the amino acid activation manifold, a dual look-up table (LUT) strategy for asymmetric K/V distributions, and a 1-bit Quantized Johnson-Lindenstrauss (QJL) residual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
