TurboESM: Ultra-Efficient 3-Bit KV Cache Quantization for Protein Language Models with Orthogonal Rotation and QJL Correction

Yue Hu; Junqing Wang; Yingchao Liu

arXiv:2603.26110·q-bio.QM·March 30, 2026

TurboESM: Ultra-Efficient 3-Bit KV Cache Quantization for Protein Language Models with Orthogonal Rotation and QJL Correction

Yue Hu, Junqing Wang, Yingchao Liu

PDF

TL;DR

TurboESM introduces a 3-bit quantization method for protein language model key-value caches, significantly reducing memory usage while maintaining high accuracy, enabling more efficient inference on single GPUs.

Contribution

It adapts TurboQuant for PLMs, solving RoPE-orthogonal transformation issues, and introduces calibration, LUT, and residual correction techniques for effective 3-bit quantization.

Findings

01

Achieves 7.1x memory reduction on ESM-2 650M with cosine similarity > 0.96

02

Provides a 1.96x speedup in KV fetch operations with a fused decode attention kernel

03

Addresses amino acid vocabulary sparsity, improving quantization robustness.

Abstract

The rapid scaling of Protein Language Models (PLMs) has unlocked unprecedented accuracy in protein structure prediction and design, but the quadratic memory growth of the Key-Value (KV) cache during inference remains a prohibitive barrier for single-GPU deployment and high-throughput generation. While 8-bit quantization is now standard, 3-bit quantization remains elusive due to severe numerical outliers in activations. This paper presents TurboESM, an adaptation of Google's TurboQuant to the PLM domain. We solve the fundamental incompatibility between Rotary Position Embeddings (RoPE) and orthogonal transformations by deriving a RoPE-first rotation pipeline. We introduce a head-wise SVD calibration method tailored to the amino acid activation manifold, a dual look-up table (LUT) strategy for asymmetric K/V distributions, and a 1-bit Quantized Johnson-Lindenstrauss (QJL) residual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.