TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly

Toshiaki Koike-Akino; Jing Liu; Ye Wang

arXiv:2603.19296·cs.LG·March 25, 2026

TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly

Toshiaki Koike-Akino, Jing Liu, Ye Wang

PDF

Open Access

TL;DR

This paper introduces TTQ, a test-time quantization framework that compresses large models on the fly during inference, improving speed and accuracy without retraining, even under domain shift conditions.

Contribution

The paper presents a novel activation-aware, online calibration method for test-time quantization that adapts to unseen downstream tasks without retraining.

Findings

01

TTQ outperforms state-of-the-art quantization baselines.

02

TTQ achieves significant inference speedup.

03

TTQ maintains high accuracy across diverse tasks.

Abstract

To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these methods highly rely on calibration data, domain shift issues may arise for unseen downstream tasks. We propose a test-time quantization (TTQ) framework which compresses large models on the fly at inference time to resolve this issue. With an efficient online calibration, instant activation-aware quantization can adapt every prompt regardless of the downstream tasks, yet achieving inference speedup. Several experiments demonstrate that TTQ can improve the quantization performance over state-of-the-art baselines.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Data Compression Techniques