StarVLA-$\alpha$: Reducing Complexity in Vision-Language-Action Systems

Jinhui Ye; Ning Gao; Senqiao Yang; Jinliang Zheng; Zixuan Wang; Yuxin Chen; Pengguang Chen; Yilun Chen; Shu Liu; Jiaya Jia

arXiv:2604.11757·cs.RO·April 14, 2026

StarVLA-$\alpha$: Reducing Complexity in Vision-Language-Action Systems

Jinhui Ye, Ning Gao, Senqiao Yang, Jinliang Zheng, Zixuan Wang, Yuxin Chen, Pengguang Chen, Yilun Chen, Shu Liu, Jiaya Jia

PDF

1 Repo

TL;DR

StarVLA-α introduces a simplified, strong baseline for vision-language-action models that achieves competitive performance across multiple benchmarks, emphasizing minimal design complexity.

Contribution

It presents a minimalistic yet effective VLA baseline that simplifies architecture and training, enabling systematic analysis and strong performance.

Findings

01

The baseline remains highly competitive across multiple benchmarks.

02

A single generalist model outperforms previous methods on RoboChallenge.

03

Minimal design can achieve strong results without complex engineering.

Abstract

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for building general-purpose robotic agents. However, the VLA landscape remains highly fragmented and complex: as existing approaches vary substantially in architectures, training data, embodiment configurations, and benchmark-specific engineering. In this work, we introduce StarVLA- $α$ , a simple yet strong baseline designed to study VLA design choices under controlled conditions. StarVLA- $α$ deliberately minimizes architectural and pipeline complexity to reduce experimental confounders and enable systematic analysis. Specifically, we re-evaluate several key design axes, including action modeling strategies, robot-specific pretraining, and interface engineering. Across unified multi-benchmark training on LIBERO, SimplerEnv, RoboTwin, and RoboCasa, the same simple baseline remains highly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

starVLA/starVLA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.