CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling

Hao Li; Shuai Yang; Yilun Chen; Xinyi Chen; Xiaoda Yang; Yang Tian; Hanqing Wang; Tai Wang; Dahua Lin; Feng Zhao; Jiangmiao Pang

arXiv:2506.19816·cs.RO·October 31, 2025

CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling

Hao Li, Shuai Yang, Yilun Chen, Xinyi Chen, Xiaoda Yang, Yang Tian, Hanqing Wang, Tai Wang, Dahua Lin, Feng Zhao, Jiangmiao Pang

PDF

Open Access 10 Models

TL;DR

CronusVLA introduces a multi-frame vision-language-action framework that enhances robotic manipulation by leveraging temporal information efficiently, resulting in improved robustness and performance in simulated and real-world tasks.

Contribution

It extends single-frame VLA models to a multi-frame paradigm with a two-stage training process, improving robustness and efficiency in robotic manipulation.

Findings

01

Achieves 70.9% success rate on SimplerEnv benchmark.

02

Outperforms OpenVLA by 26.8% on LIBERO.

03

Demonstrates superior robustness under observational disturbances.

Abstract

Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail to fully leverage the temporal information offered by multi-frame histories, as directly feeding multiple frames into VLM backbones incurs substantial computational overhead and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Human Motion and Animation