An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU

Ruijia Yang; Zeyi Wen

arXiv:2603.16428·cs.DC·March 18, 2026

An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU

Ruijia Yang, Zeyi Wen

PDF

Open Access

TL;DR

SlideFormer is a novel system that enables efficient fine-tuning of large language models on a single GPU by reducing memory usage and increasing throughput through innovative memory management, asynchronous execution, and optimized kernels.

Contribution

The paper introduces SlideFormer, a heterogeneous co-design system that allows large language model fine-tuning on a single GPU with significant improvements in memory efficiency and performance.

Findings

01

Supports fine-tuning of 123B+ models on a single RTX 4090.

02

Achieves 1.40x to 6.27x higher throughput than baselines.

03

Halves CPU/GPU memory usage while maintaining >95% peak performance.

Abstract

Fine-tuning Large Language Models (LLMs) has become essential for domain adaptation, but its memory-intensive property exceeds the capabilities of most GPUs. To address this challenge and democratize LLM fine-tuning, we present SlideFormer, a novel system designed for single-GPU environments. Our innovations are: (1) A lightweight asynchronous engine that treats the GPU as a sliding window and overlaps GPU computation with CPU updates and multi-tier I/O. (2) A highly efficient heterogeneous memory management scheme significantly reduces peak memory usage. (3) Optimized Triton kernels to solve key bottlenecks and integrated advanced I/O. This collaborative design enables fine-tuning of the latest 123B+ models on a single RTX 4090, supporting up to 8x larger batch sizes and 6x larger models. In evaluations, SlideFormer achieves 1.40x to 6.27x higher throughput while roughly halving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Network Packet Processing and Optimization