TraCT: Disaggregated LLM Serving with CXL Shared Memory KV Cache at Rack-Scale
Dongha Yoon, Younghoon Min, Hoshik Kim, Sam H. Noh, and Jongryool Kim

TL;DR
TraCT leverages CXL shared memory for disaggregated LLM serving, significantly reducing latency and increasing throughput by eliminating network bottlenecks in KV transfer.
Contribution
This paper introduces TraCT, a novel rack-scale LLM serving system using CXL shared memory for direct KV access, addressing synchronization and consistency challenges.
Findings
Up to 9.8x reduction in average TTFT
Up to 6.2x lower P99 latency
Up to 1.6x peak throughput improvement
Abstract
Disaggregated LLM serving improves resource efficiency by separating the compute-intensive prefill phase from the latency-critical decode phase. However, this architecture introduces a fundamental bottleneck: key/value (KV) tensors generated during prefill must be transferred to decode workers, and existing systems rely on RDMA-based network paths for this exchange. As model sizes and context lengths increase, KV transfer dominates both time-to-first-token (TTFT) and peak throughput, and remains highly sensitive to network contention even when prefix reuse is high. This paper presents TraCT, a rack-scale LLM serving system that uses CXL shared memory as both a KV-transfer substrate and a rack-wide prefix-aware KV cache. TraCT enables GPUs to write and read KV blocks directly through CXL load/store and DMA operations, eliminating the NIC hop that constrains existing disaggregated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Network Packet Processing and Optimization
