SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead

Chaojun Ni; Cheng Chen; Xiaofeng Wang; Zheng Zhu; Wenzhao Zheng; Boyuan Wang; Tianrun Chen; Guosheng Zhao; Haoyun Li; Zhehao Dong; Qiang Zhang; Yun Ye; Yang Wang; Guan Huang; Wenjun Mei

arXiv:2512.00903·cs.CV·December 2, 2025

SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead

Chaojun Ni, Cheng Chen, Xiaofeng Wang, Zheng Zhu, Wenzhao Zheng, Boyuan Wang, Tianrun Chen, Guosheng Zhao, Haoyun Li, Zhehao Dong, Qiang Zhang, Yun Ye, Yang Wang, Guan Huang, Wenjun Mei

PDF

Open Access

TL;DR

SwiftVLA introduces a lightweight architecture that incorporates 4D spatiotemporal understanding into vision-language-action models, enabling efficient action reasoning with minimal overhead and high performance on edge devices.

Contribution

The paper proposes SwiftVLA, a novel method that integrates 4D features into compact VLA models using a pretrained 4D transformer, Fusion Tokens, and a mask-and-reconstruct training strategy.

Findings

01

Outperforms lightweight baselines in real and simulated environments.

02

Rivals larger VLAs with up to 7 times more parameters.

03

Achieves 18x faster inference and 12x less memory usage.

Abstract

Vision-Language-Action (VLA) models built on pretrained Vision-Language Models (VLMs) show strong potential but are limited in practicality due to their large parameter counts. To mitigate this issue, using a lightweight VLM has been explored, but it compromises spatiotemporal reasoning. Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency. Specifically, our approach features a pretrained 4D visual geometry transformer with a temporal cache that extracts 4D features from 2D images. Then, to enhance the VLM's ability to exploit both 2D images and 4D features, we introduce Fusion Tokens, a set of learnable tokens trained with a future…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning