SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor; Dana Aubakirova; Francesco Capuano; Pepijn Kooijmans; Steven Palma; Adil Zouitine; Michel Aractingi; Caroline Pascal; Martino Russi; Andres Marafioti; Simon Alibert; Matthieu Cord; Thomas Wolf; Remi Cadene

arXiv:2506.01844·cs.LG·June 3, 2025

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, Remi Cadene

PDF

Open Access 1 Repo 10 Models 5 Datasets

TL;DR

SmolVLA is a compact, efficient vision-language-action model for robotics that reduces training and deployment costs while maintaining high performance, enabling broader accessibility and real-time responsiveness in robotic applications.

Contribution

This work introduces SmolVLA, a small and efficient VLA model trained on community data, capable of running on consumer hardware with competitive performance.

Findings

01

Achieves comparable performance to larger VLAs despite being 10x smaller.

02

Can be trained on a single GPU and deployed on consumer-grade hardware.

03

Improves responsiveness with asynchronous inference and chunked action generation.

Abstract

Vision-language models (VLMs) pretrained on large-scale multimodal datasets encode rich visual and linguistic knowledge, making them a strong foundation for robotics. Rather than training robotic policies from scratch, recent approaches adapt VLMs into vision-language-action (VLA) models that enable natural language-driven perception and control. However, existing VLAs are typically massive--often with billions of parameters--leading to high training costs and limited real-world deployability. Moreover, they rely on academic and industrial datasets, overlooking the growing availability of community-collected data from affordable robotic platforms. In this work, we present SmolVLA, a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance. SmolVLA is designed to be trained on a single GPU and deployed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

huggingface/lerobot
jaxOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotic Path Planning Algorithms · Robot Manipulation and Learning