SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, Remi Cadene

TL;DR
SmolVLA is a compact, efficient vision-language-action model for robotics that reduces training and deployment costs while maintaining high performance, enabling broader accessibility and real-time responsiveness in robotic applications.
Contribution
This work introduces SmolVLA, a small and efficient VLA model trained on community data, capable of running on consumer hardware with competitive performance.
Findings
Achieves comparable performance to larger VLAs despite being 10x smaller.
Can be trained on a single GPU and deployed on consumer-grade hardware.
Improves responsiveness with asynchronous inference and chunked action generation.
Abstract
Vision-language models (VLMs) pretrained on large-scale multimodal datasets encode rich visual and linguistic knowledge, making them a strong foundation for robotics. Rather than training robotic policies from scratch, recent approaches adapt VLMs into vision-language-action (VLA) models that enable natural language-driven perception and control. However, existing VLAs are typically massive--often with billions of parameters--leading to high training costs and limited real-world deployability. Moreover, they rely on academic and industrial datasets, overlooking the growing availability of community-collected data from affordable robotic platforms. In this work, we present SmolVLA, a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance. SmolVLA is designed to be trained on a single GPU and deployed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗lerobot/smolvla_basemodel· 33k dl· ♡ 35733k dl♡ 357
- 🤗HuggingFaceVLA/smolvla_liberomodel· 8.7k dl· ♡ 98.7k dl♡ 9
- 🤗mjung11/smolvla_nc_total_50000_n10model· ♡ 1♡ 1
- 🤗pepijn223/mobile_so100_testmodel· 8 dl· ♡ 18 dl♡ 1
- 🤗rancheng222/smolvla_so101_tie_bagmodel· 8 dl· ♡ 18 dl♡ 1
- 🤗jccj/smolvla_pickup_home_45kmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗jccj/smolvla_pickup_cube_full_resmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗jccj/smolvla_pickup_cube_full_res_lastmodel· 4 dl4 dl
- 🤗jccj/smolvla_pickup_cube_resized_lastmodel· 1 dl· ♡ 11 dl♡ 1
- 🤗pepijn223/my_smolvlamodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotic Path Planning Algorithms · Robot Manipulation and Learning
