Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

I. Apanasevich; M. Artemyev; R. Babakyan; P. Fedotova; D. Grankin; E. Kupryashin; A. Misailidi; D. Nerus; A. Nutalapati; G. Sidorov; I. Efremov; M. Gerasyov; D. Pikurov; Y. Senchenko; S. Davidenko; D. Kulikov; M. Sultankin; K. Askarbek; O. Shamanin; D. Statovoy; E. Zalyaev; I. Zorin; A. Letkin; E. Rusakov; A. Silchenko; V. Vorobyov; S. Sobolnikov; A. Postnikov

arXiv:2602.00919·cs.RO·March 10, 2026

Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

I. Apanasevich, M. Artemyev, R. Babakyan, P. Fedotova, D. Grankin, E. Kupryashin, A. Misailidi, D. Nerus, A. Nutalapati, G. Sidorov, I. Efremov, M. Gerasyov, D. Pikurov, Y. Senchenko, S. Davidenko, D. Kulikov, M. Sultankin, K. Askarbek, O. Shamanin, D. Statovoy, E. Zalyaev

PDF

Open Access 9 Models

TL;DR

Green-VLA is a staged vision-language-action framework enabling generalist robots to operate across diverse embodiments with improved safety, robustness, and efficiency through a curriculum of training stages and real-world evaluations.

Contribution

The paper introduces a novel staged VLA framework with a unified embodiment-aware interface and extensive real-world data, advancing robot generalization and safety.

Findings

01

Strong generalization across multiple robot types.

02

Significant performance improvements with RL alignment.

03

Effective real-robot deployment demonstrating robustness.

Abstract

We introduce Green-VLA, a staged Vision-Language-Action (VLA) framework for real-world deployment on the Green humanoid robot while maintaining generalization across diverse embodiments. Green-VLA follows a five stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodiment pretraining, (R1) embodiment-specific adaptation, and (R2) reinforcement-learning (RL) policy alignment. We couple a scalable data-processing pipeline (3,000 hours of demonstrations) with temporal alignment and quality filtering, and use a unified, embodiment-aware action interface enabling a single policy to control humanoids, mobile manipulators, and fixed-base arms. At inference, the VLA controller is enhanced with episode-progress prediction, out-of-distribution detection, and joint-prediction-based guidance to improve safety and precise target selection. Experiments on Simpler BRIDGE…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Social Robot Interaction and HRI · Multimodal Machine Learning Applications