OOM-Free Alpamayo via CPU-GPU Memory Swapping for Vision-Language-Action Models
Seungwoo Roh, Huiyeong Kim, Jong-Chan Kim

TL;DR
This paper introduces a system-level framework that enables memory-efficient inference for vision-language-action models on GPUs with limited VRAM, without modifying the models themselves.
Contribution
It proposes a three-stage memory management framework and a performance prediction model to optimize GPU memory usage and inference speed for large models.
Findings
Achieves up to 3.55x speedup over existing offloading methods.
Reduces VRAM usage from model-level to layer-level granularity.
Maintains full BF16 precision during inference.
Abstract
End-to-end Vision-Language-Action (VLA) models for autonomous driving unify perception, reasoning, and control in a single neural network, achieving strong driving performance but requiring 20-60GB of GPU memory-far exceeding the 12-16GB available on commodity GPUs. We present a framework, which enables memory-efficient VLA inference on VRAM-constrained GPUs through system-level optimization alone, without model modification. Our work proceeds in three stages: (1) Sequential Demand Layering reduces VRAM usage from model-level to layer-level granularity; (2) Pipelined Demand Layering hides parameter transfer time within layer execution time via transfer--compute overlap; and (3) a GPU-Resident Layer Decision Policy, informed by per-module residency benefit analysis, eliminates the residual transfer overhead that pipelining cannot hide. We further propose a performance prediction model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
