SmolVLM: Redefining small and efficient multimodal models
Andr\'es Marafioti, Orr Zohar, Miquel Farr\'e, Merve Noyan, Elie, Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane, Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis, Tunstall, Leandro von Werra, Thomas Wolf

TL;DR
SmolVLM introduces a series of resource-efficient multimodal models that outperform larger models in image and video tasks while using significantly less GPU memory, enabling practical deployment on edge devices.
Contribution
The paper presents novel architectural and tokenization strategies for small VLMs, achieving high performance with minimal memory usage, and demonstrates their effectiveness on image and video tasks.
Findings
SmolVLM-256M uses less than 1GB GPU memory during inference.
SmolVLM outperforms much larger models like Idefics-80B despite smaller size.
Models show strong video comprehension capabilities.
Abstract
Large Vision-Language Models (VLMs) deliver exceptional performance but require significant computational resources, limiting their deployment on mobile and edge devices. Smaller VLMs typically mirror design choices of larger models, such as extensive image tokenization, leading to inefficient GPU memory usage and constrained practicality for on-device applications. We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference. We systematically explore architectural configurations, tokenization strategies, and data curation optimized for low computational overhead. Through this, we identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints. Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory during inference and outperforms the 300-times larger…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗HuggingFaceTB/SmolVLM2-2.2B-Instructmodel· 107k dl· ♡ 311107k dl♡ 311
- 🤗HuggingFaceTB/SmolVLM2-500M-Video-Instructmodel· 258k dl· ♡ 126258k dl♡ 126
- 🤗HuggingFaceTB/SmolVLM-Instructmodel· 30k dl· ♡ 58130k dl♡ 581
- 🤗HuggingFaceTB/SmolVLM-256M-Instructmodel· 312k dl· ♡ 345312k dl♡ 345
- 🤗HuggingFaceTB/SmolVLM-500M-Instructmodel· 39k dl· ♡ 19039k dl♡ 190
- 🤗HuggingFaceTB/SmolVLM2-256M-Video-Instructmodel· 156k dl· ♡ 98156k dl♡ 98
- 🤗HuggingFaceTB/SmolVLM2-2.2B-Basemodel· 191 dl· ♡ 11191 dl♡ 11
- 🤗yushan777/SmolVLM-Instructmodel· 1 dl1 dl
- 🤗yushan777/SmolVLM-500M-Instructmodel· 5 dl5 dl
- 🤗yushan777/SmolVLM-256M-Instructmodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
