TinyVLM: Zero-Shot Object Detection on Microcontrollers via Vision-Language Distillation with Matryoshka Embeddings
Bibin Wilson

TL;DR
TinyVLM is a novel framework that enables zero-shot object detection on microcontrollers with less than 1MB memory, using innovative embedding techniques and a decoupled architecture for real-time edge device performance.
Contribution
The paper introduces TinyVLM, a lightweight zero-shot detection framework with a decoupled architecture, Matryoshka distillation, and quantized embeddings, suitable for resource-constrained microcontrollers.
Findings
Achieves competitive zero-shot accuracy on COCO, Flowers102, Food101 datasets.
Runs at 26 FPS on STM32H7 and over 1,000 FPS on MAX78000.
Uses less than 1MB memory, enabling practical edge deployment.
Abstract
Zero-shot object detection enables recognising novel objects without task-specific training, but current approaches rely on large vision language models (VLMs) like CLIP that require hundreds of megabytes of memory - far exceeding the constraints of micro controller units (MCUs). We present TinyVLM, the first framework enabling zero-shot object detection on resource-constrained MCUs with less than 1MB of memory. Our approach introduces three key innovations: (1) a decoupled architecture that separates visual inference from text encoding, allowing precomputed class embeddings to be stored in flash memory; (2) Matryoshka distillation that trains nested embeddings at multiple dimensions (16-256), enabling flexible accuracy-memory trade-offs; and (3) quantized embedding storage that reduces class prototype memory by 4x with minimal accuracy loss. Trained on Conceptual Captions 3M (CC3M),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
