OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision   Encoders for Multimodal Learning

Xianhang Li; Yanqing Liu; Haoqin Tu; Hongru Zhu; Cihang Xie

arXiv:2505.04601·cs.CV·May 8, 2025

OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning

Xianhang Li, Yanqing Liu, Haoqin Tu, Hongru Zhu, Cihang Xie

PDF

Open Access 10 Models

TL;DR

OpenVision introduces a fully open, cost-effective family of vision encoders that match or outperform proprietary models like CLIP, offering flexible options for multimodal learning with various model sizes.

Contribution

This work presents a fully open-source vision encoder family, OpenVision, with detailed training recipes and data, enabling accessible and high-performance multimodal model development.

Findings

01

OpenVision encoders match or surpass CLIP performance.

02

Larger models improve multimodal task accuracy.

03

Smaller models enable efficient edge deployment.

Abstract

OpenAI's CLIP, released in early 2021, have long been the go-to choice of vision encoder for building multimodal foundation models. Although recent alternatives such as SigLIP have begun to challenge this status quo, to our knowledge none are fully open: their training data remains proprietary and/or their training recipes are not released. This paper fills this gap with OpenVision, a fully-open, cost-effective family of vision encoders that match or surpass the performance of OpenAI's CLIP when integrated into multimodal frameworks like LLaVA. OpenVision builds on existing works -- e.g., CLIPS for training framework and Recap-DataComp-1B for training data -- while revealing multiple key insights in enhancing encoder quality and showcasing practical benefits in advancing multimodal models. By releasing vision encoders spanning from 5.9M to 632.1M parameters, OpenVision offers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Sensor and Energy Harvesting Materials · Interactive and Immersive Displays

MethodsContrastive Language-Image Pre-training