TL;DR
StarVLA-α introduces a simplified, strong baseline for vision-language-action models that achieves competitive performance across multiple benchmarks, emphasizing minimal design complexity.
Contribution
It presents a minimalistic yet effective VLA baseline that simplifies architecture and training, enabling systematic analysis and strong performance.
Findings
The baseline remains highly competitive across multiple benchmarks.
A single generalist model outperforms previous methods on RoboChallenge.
Minimal design can achieve strong results without complex engineering.
Abstract
Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for building general-purpose robotic agents. However, the VLA landscape remains highly fragmented and complex: as existing approaches vary substantially in architectures, training data, embodiment configurations, and benchmark-specific engineering. In this work, we introduce StarVLA-, a simple yet strong baseline designed to study VLA design choices under controlled conditions. StarVLA- deliberately minimizes architectural and pipeline complexity to reduce experimental confounders and enable systematic analysis. Specifically, we re-evaluate several key design axes, including action modeling strategies, robot-specific pretraining, and interface engineering. Across unified multi-benchmark training on LIBERO, SimplerEnv, RoboTwin, and RoboCasa, the same simple baseline remains highly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
