SmolVLM: Redefining small and efficient multimodal models

Andr\'es Marafioti; Orr Zohar; Miquel Farr\'e; Merve Noyan; Elie; Bakouch; Pedro Cuenca; Cyril Zakka; Loubna Ben Allal; Anton Lozhkov; Nouamane; Tazi; Vaibhav Srivastav; Joshua Lochner; Hugo Larcher; Mathieu Morlon; Lewis; Tunstall; Leandro von Werra; Thomas Wolf

arXiv:2504.05299·cs.AI·April 8, 2025·2 cites

SmolVLM: Redefining small and efficient multimodal models

Andr\'es Marafioti, Orr Zohar, Miquel Farr\'e, Merve Noyan, Elie, Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane, Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis, Tunstall, Leandro von Werra, Thomas Wolf

PDF

Open Access 10 Models

TL;DR

SmolVLM introduces a series of resource-efficient multimodal models that outperform larger models in image and video tasks while using significantly less GPU memory, enabling practical deployment on edge devices.

Contribution

The paper presents novel architectural and tokenization strategies for small VLMs, achieving high performance with minimal memory usage, and demonstrates their effectiveness on image and video tasks.

Findings

01

SmolVLM-256M uses less than 1GB GPU memory during inference.

02

SmolVLM outperforms much larger models like Idefics-80B despite smaller size.

03

Models show strong video comprehension capabilities.

Abstract

Large Vision-Language Models (VLMs) deliver exceptional performance but require significant computational resources, limiting their deployment on mobile and edge devices. Smaller VLMs typically mirror design choices of larger models, such as extensive image tokenization, leading to inefficient GPU memory usage and constrained practicality for on-device applications. We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference. We systematically explore architectural configurations, tokenization strategies, and data curation optimized for low computational overhead. Through this, we identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints. Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory during inference and outperforms the 300-times larger…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques