Analysis and Optimized CXL-Attached Memory Allocation for Long-Context LLM Fine-Tuning
Yong-Cheng Liaw, Shuo-Han Chen

TL;DR
This paper proposes a CXL-aware memory management approach for long-context LLM fine-tuning, significantly improving throughput by intelligently allocating memory across local DRAM and CXL devices.
Contribution
It introduces a PyTorch extension and memory allocator that enable fine-grained tensor control and optimized placement across CXL and DRAM, addressing current framework limitations.
Findings
Achieves 97-99% of DRAM-only throughput with a single AIC.
Provides up to 21% speedup over naive memory placement.
Enables scaling of long-context fine-tuning beyond DRAM capacity.
Abstract
The substantial memory requirements of Large Language Models (LLMs), particularly for long-context fine-tuning, have renewed interest in CPU offloading to augment limited GPU memory. However, as context lengths grow, relying on CPU memory for intermediate states introduces a significant bottleneck that can exhaust the capacity of mainstream client platforms. To address this limitation, this work investigates the effectiveness of Compute Express Link (CXL) add-in card (AIC) memory as an extension to CPU memory, enabling larger model sizes and longer context lengths during fine-tuning. Extensive benchmarking reveals two critical challenges. First, current deep learning frameworks such as PyTorch lack fine-grained, per-tensor control over NUMA memory allocation, exposing only coarse, process-level policies. Second, due to this lack of control, when the memory footprint of fine-tuning is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvancements in Photolithography Techniques · Medical Imaging Techniques and Applications
