TL;DR
OxyGen introduces a unified KV cache management system for multi-task VLA inference, significantly improving efficiency and throughput on edge devices by sharing resources across tasks and decoupling decoding processes.
Contribution
It presents a novel inference design that treats KV cache as a shared resource, enabling cross-task sharing and cross-frame batching for efficient multi-task VLA inference.
Findings
Achieves up to 3.7× speedup over isolated execution.
Delivers over 200 tokens/s language throughput.
Maintains 70 Hz action frequency without degrading quality.
Abstract
Embodied AI agents increasingly require parallel execution of multiple tasks, such as manipulation, conversation, and memory construction, from shared observations under distinct time constraints. Recent Mixture-of-Transformers (MoT) Vision-Language-Action Models (VLAs) architecturally support such heterogeneous outputs, yet existing inference systems fail to achieve efficient multi-task parallelism for on-device deployment because of redundant computation and resource contention. We identify isolated KV cache management as the root cause. To address this, we propose unified KV cache management, an inference design that treats the KV cache as a first-class shared resource across tasks and over time. This abstraction enables two key optimizations: cross-task KV sharing eliminates redundant prefill of shared observations, while cross-frame continuous batching decouples variable-length…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices
