TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation

Junjie Wen; Yichen Zhu; Jinming Li; Minjie Zhu; Kun Wu; Zhiyuan Xu; Ning Liu; Ran Cheng; Chaomin Shen; Yaxin Peng; Feifei Feng; Jian Tang

arXiv:2409.12514·cs.RO·May 14, 2025·3 cites

TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, Jian Tang

PDF

Open Access 1 Repo

TL;DR

TinyVLA introduces a compact, fast, and data-efficient vision-language-action model for robotic manipulation that outperforms existing models in speed, data requirements, and generalization, without needing extensive pre-training.

Contribution

The paper presents TinyVLA, a novel VLA model that achieves faster inference and eliminates pre-training, leveraging multimodal models and diffusion policy decoders for improved robotic control.

Findings

01

Outperforms OpenVLA in speed and data efficiency

02

Maintains or exceeds performance across diverse tasks and conditions

03

Demonstrates strong generalization to new objects and environments

Abstract

Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes. However, current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data, making real-world deployment difficult. In this paper, we introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models: (1) faster inference speeds, and (2) improved data efficiency, eliminating the need for pre-training stage. Our framework incorporates two essential components to build TinyVLA: (1) initializing the policy backbone with robust, high-speed multimodal models, and (2) integrating a diffusion policy decoder during fine-tuning to enable precise robot actions. We conducted extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

liyaxuanliyaxuan/TinyVLA
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Advanced Neural Network Applications

MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings