TL;DR
Mix-Quant is a phase-aware quantization framework that accelerates the prefilling stage in agentic LLMs, significantly reducing inference time while maintaining task performance.
Contribution
It introduces a novel approach to quantize only the prefilling phase with minimal accuracy loss, improving efficiency in agentic LLM workflows.
Findings
Up to 3x speedup in prefilling inference time.
Minimal performance degradation across long-context and agentic benchmarks.
Effective decoupling of prefilling acceleration from decoding quality.
Abstract
LLM agents have recently emerged as a powerful paradigm for solving complex tasks through planning, tool use, memory retrieval, and multi-step interaction. However, these agentic workflows often introduce substantial input-side overhead, making the compute-intensive prefilling stage a key bottleneck in long-context, multi-turn inference. In this work, we propose Mix-Quant, a simple and effective phase-aware quantization framework for fast agentic inference. We first investigate FP4 quantization in agentic LLM workflows and observe that quantizing the entire inference process can incur significant performance degradation. In contrast, the prefilling stage exhibits substantial quantization redundancy and can therefore be quantized with minimal accuracy loss, despite being the dominant source of computation. Based on this insight, we apply high-throughput NVFP4 quantization to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
