CONCUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control

Qiaoling Chen; Zhisheng Ye; Tian Tang; Peng Sun; Boyu Tian; Guoteng Wang; Shenggui Li; Yonggang Wen; Zhenhua Han; Tianwei Zhang

arXiv:2601.22705·cs.DC·February 2, 2026

CONCUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control

Qiaoling Chen, Zhisheng Ye, Tian Tang, Peng Sun, Boyu Tian, Guoteng Wang, Shenggui Li, Yonggang Wen, Zhenhua Han, Tianwei Zhang

PDF

Open Access

TL;DR

CONCUR introduces a congestion-based control mechanism to manage agent workloads in large language model inference, significantly improving throughput by preventing cache thrashing.

Contribution

It proposes a novel proactive agent-level admission control method inspired by congestion control, enhancing GPU cache efficiency during batch inference.

Findings

01

Up to 4.09x throughput improvement on Qwen3-32B

02

Up to 1.9x throughput improvement on DeepSeek-V3

03

Prevents middle-phase cache thrashing

Abstract

Batch inference for agentic workloads stresses the GPU key-value (KV) cache in a sustained and cumulative manner, often causing severe throughput degradation well before memory capacity is exhausted. We identify this phenomenon as middle-phase thrashing, a previously under-characterized pathology in which cache efficiency collapses as long-lived agents accumulate state over time. We argue that mitigating this pathology requires moving beyond reactive, request-level cache management to proactive, agent-level admission control. Drawing inspiration from congestion control in distributed systems, we view the KV cache as a shared resource whose efficient utilization depends on feedback-driven regulation. Based on this insight, we present CONCUR, a lightweight control layer that regulates agent admission to bound aggregate cache pressure while preserving execution continuity. CONCUR adapts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Cloud Computing and Resource Management · Distributed and Parallel Computing Systems