KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving
Yichao Yuan, Mosharaf Chowdhury, Nishil Talati

TL;DR
KAIROS is a novel, context-aware system designed to optimize power efficiency in agentic AI inference serving by dynamically managing GPU resources based on long-lived context and agent progress.
Contribution
It introduces a new power management approach that considers agent context and request evolution, outperforming traditional single-turn optimization techniques.
Findings
Achieves 27% average power reduction while maintaining performance.
Effectively manages GPU frequency and request placement based on agent context.
Reduces power consumption across diverse agentic AI tasks.
Abstract
Power has become a central bottleneck for AI inference. This problem is becoming more urgent as agentic AI emerges as a major workload class, yet prior power-management techniques focus almost entirely on single-turn LLM serving. Our analysis shows that agentic serving behaves fundamentally differently: each request carries long-lived context that evolves across tool-interleaved turns, and lowering GPU frequency can push the system into a thrashing regime where memory pressure sharply worsens both performance and power efficiency. These observations show that power optimization for agentic serving requires rethinking. We present KAIROS, a context-aware power optimization system for agentic AI serving. KAIROS uses agent context as a first-class control signal to jointly manage GPU frequency, per-instance concurrency, and multi-instance request placement. This enables KAIROS to save…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
