CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling
Hao Li, Shuai Yang, Yilun Chen, Xinyi Chen, Xiaoda Yang, Yang Tian, Hanqing Wang, Tai Wang, Dahua Lin, Feng Zhao, Jiangmiao Pang

TL;DR
CronusVLA introduces a multi-frame vision-language-action framework that enhances robotic manipulation by leveraging temporal information efficiently, resulting in improved robustness and performance in simulated and real-world tasks.
Contribution
It extends single-frame VLA models to a multi-frame paradigm with a two-stage training process, improving robustness and efficiency in robotic manipulation.
Findings
Achieves 70.9% success rate on SimplerEnv benchmark.
Outperforms OpenVLA by 26.8% on LIBERO.
Demonstrates superior robustness under observational disturbances.
Abstract
Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail to fully leverage the temporal information offered by multi-frame histories, as directly feeding multiple frames into VLM backbones incurs substantial computational overhead and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗JeasLee/openvla-0.5b-prismaticmodel· 5 dl5 dl
- 🤗JeasLee/cronusvla_7B_bridge_rt_1model· 4 dl4 dl
- 🤗JeasLee/cronusvla_0.5B_bridge_rt_1model
- 🤗JeasLee/cronusvla_7B_libero_goalmodel· 3 dl3 dl
- 🤗JeasLee/cronusvla_7B_libero_goal_w_wristmodel· 5 dl5 dl
- 🤗JeasLee/cronusvla_7B_libero_10model· 2 dl2 dl
- 🤗JeasLee/cronusvla_7B_libero_objectmodel· 3 dl3 dl
- 🤗JeasLee/cronusvla_7B_libero_object_w_wristmodel· 10 dl10 dl
- 🤗JeasLee/cronusvla_7B_libero_spatialmodel· 2 dl2 dl
- 🤗JeasLee/cronusvla_7B_libero_spatial_w_wristmodel· 6 dl6 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Human Motion and Animation
