Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Haiquan Lu; Zigeng Chen; Gongfan Fang; Xinyin Ma; Xinchao Wang

arXiv:2605.20315·cs.CL·May 21, 2026

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Haiquan Lu, Zigeng Chen, Gongfan Fang, Xinyin Ma, Xinchao Wang

PDF

1 Repo

TL;DR

Mix-Quant is a phase-aware quantization framework that accelerates the prefilling stage in agentic LLMs, significantly reducing inference time while maintaining task performance.

Contribution

It introduces a novel approach to quantize only the prefilling phase with minimal accuracy loss, improving efficiency in agentic LLM workflows.

Findings

01

Up to 3x speedup in prefilling inference time.

02

Minimal performance degradation across long-context and agentic benchmarks.

03

Effective decoupling of prefilling acceleration from decoding quality.

Abstract

LLM agents have recently emerged as a powerful paradigm for solving complex tasks through planning, tool use, memory retrieval, and multi-step interaction. However, these agentic workflows often introduce substantial input-side overhead, making the compute-intensive prefilling stage a key bottleneck in long-context, multi-turn inference. In this work, we propose Mix-Quant, a simple and effective phase-aware quantization framework for fast agentic inference. We first investigate FP4 quantization in agentic LLM workflows and observe that quantizing the entire inference process can incur significant performance degradation. In contrast, the prefilling stage exhibits substantial quantization redundancy and can therefore be quantized with minimal accuracy loss, despite being the dominant source of computation. Based on this insight, we apply high-throughput NVFP4 quantization to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

haiquanlu/Mix-Quant
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.