Green-VLA: Staged Vision-Language-Action Model for Generalist Robots
I. Apanasevich, M. Artemyev, R. Babakyan, P. Fedotova, D. Grankin, E. Kupryashin, A. Misailidi, D. Nerus, A. Nutalapati, G. Sidorov, I. Efremov, M. Gerasyov, D. Pikurov, Y. Senchenko, S. Davidenko, D. Kulikov, M. Sultankin, K. Askarbek, O. Shamanin, D. Statovoy, E. Zalyaev

TL;DR
Green-VLA is a staged vision-language-action framework enabling generalist robots to operate across diverse embodiments with improved safety, robustness, and efficiency through a curriculum of training stages and real-world evaluations.
Contribution
The paper introduces a novel staged VLA framework with a unified embodiment-aware interface and extensive real-world data, advancing robot generalization and safety.
Findings
Strong generalization across multiple robot types.
Significant performance improvements with RL alignment.
Effective real-robot deployment demonstrating robustness.
Abstract
We introduce Green-VLA, a staged Vision-Language-Action (VLA) framework for real-world deployment on the Green humanoid robot while maintaining generalization across diverse embodiments. Green-VLA follows a five stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodiment pretraining, (R1) embodiment-specific adaptation, and (R2) reinforcement-learning (RL) policy alignment. We couple a scalable data-processing pipeline (3,000 hours of demonstrations) with temporal alignment and quality filtering, and use a unified, embodiment-aware action interface enabling a single policy to control humanoids, mobile manipulators, and fixed-base arms. At inference, the VLA controller is enhanced with episode-progress prediction, out-of-distribution detection, and joint-prediction-based guidance to improve safety and precise target selection. Experiments on Simpler BRIDGE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗SberRoboticsCenter/GreenVLA-5b-base-stride-4model· 17 dl17 dl
- 🤗SberRoboticsCenter/GreenVLA-5b-stride-1-R1-bridgemodel· 15 dl15 dl
- 🤗SberRoboticsCenter/GreenVLA-5b-stride-4-R1-fractalmodel· 1.1k dl1.1k dl
- 🤗SberRoboticsCenter/GreenVLA-5b-stride-1-R2-bridgemodel· 14 dl14 dl
- 🤗SberRoboticsCenter/GreenVLA-2b-basemodel· 16 dl16 dl
- 🤗SberRoboticsCenter/GreenVLA-5b-base-stride-1model· 22 dl22 dl
- 🤗SberRoboticsCenter/GreenVLA-5b-stride-4-R2-calvinmodel· 16 dl· ♡ 116 dl♡ 1
- 🤗TirGun/Sber_Qwen3-VL-4B-Instruct-action-GGUFmodel· 302 dl302 dl
- 🤗TirGun/Sber_Qwen3-VL-2B-Instruct-action-GGUFmodel· 303 dl303 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Social Robot Interaction and HRI · Multimodal Machine Learning Applications
