StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

StarVLA Community

arXiv:2604.05014·cs.RO·April 8, 2026

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

StarVLA Community

PDF

1 Repo 1 Models

TL;DR

StarVLA is a comprehensive open-source framework that unifies vision-language-action model development, enabling modular, reproducible research across multiple benchmarks and backbones.

Contribution

It introduces a modular, interchangeable architecture, reusable training strategies, and integrated benchmarks, facilitating research and comparison in VLA models.

Findings

01

Achieves competitive performance on multiple benchmarks.

02

Supports both vision-language models and world models.

03

Provides a unified, reproducible training and evaluation pipeline.

Abstract

Building generalist embodied agents requires integrating perception, language understanding, and action, which are core capabilities addressed by Vision-Language-Action (VLA) approaches based on multimodal foundation models, including recent advances in vision-language models and world models. Despite rapid progress, VLA methods remain fragmented across incompatible architectures, codebases, and evaluation protocols, hindering principled comparison and reproducibility. We present StarVLA, an open-source codebase for VLA research. StarVLA addresses these challenges in three aspects. First, it provides a modular backbone--action-head architecture that supports both VLM backbones (e.g., Qwen-VL) and world-model backbones (e.g., Cosmos) alongside representative action-decoding paradigms, all under a shared abstraction in which backbone and action head can each be swapped independently.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

starVLA/starVLA
github

Models

🤗
StarVLA/Qwen3VL-PI_v3-Bridge-RT_1
model· 62 dl· ♡ 2
62 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.