OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model   for Efficient On-Device Inference

Wei Chen; Zhiyuan Li; Shuo Xin

arXiv:2412.11475·cs.CV·December 30, 2024

OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference

Wei Chen, Zhiyuan Li, Shuo Xin

PDF

Open Access 1 Models

TL;DR

OmniVLM is a compact vision-language model with token compression that achieves high performance and fast inference on edge devices, matching larger models' capabilities while significantly reducing computational requirements.

Contribution

The paper introduces a novel token compression mechanism and a multi-stage training pipeline for a sub-billion-parameter vision-language model optimized for on-device inference.

Findings

01

Reduces visual token sequence length from 729 to 81 tokens.

02

Achieves 9.1x faster inference speed on a laptop compared to nanoLLAVA.

03

Outperforms existing models on multiple benchmarks within a 968M-parameter footprint.

Abstract

We present OmniVLM, a sub-billion-parameter vision-language model for efficient on-device inference. OmniVLM introduces a token compression mechanism that reduces visual token sequence length from 729 to 81 tokens, significantly reducing computational overhead while preserving visual-semantic fidelity. Through a multi-stage training pipeline of pretraining, supervised fine-tuning, and minimal-edit Direct Preference Optimization (DPO), OmniVLM matches the performance of larger models. On multiple benchmarks including ScienceQA, POPE, and MMMU, OmniVLM outperforms existing baselines like nanoLLAVA within a 968M-parameter footprint. Empirical results on the same laptop demonstrate 9.1x faster time-to-first-token (0.75s vs 6.82s) and 1.5x higher decoding speed (29.41 vs 19.20 tokens/s) compared to nanoLLAVA, enabling efficient deployment on edge devices. The model weights can be accessed on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
NexaAI/OmniVLM-968M
model· 1.1k dl· ♡ 532
1.1k dl♡ 532

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Brain Tumor Detection and Classification

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings