OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
Xianhang Li, Yanqing Liu, Haoqin Tu, Hongru Zhu, Cihang Xie

TL;DR
OpenVision introduces a fully open, cost-effective family of vision encoders that match or outperform proprietary models like CLIP, offering flexible options for multimodal learning with various model sizes.
Contribution
This work presents a fully open-source vision encoder family, OpenVision, with detailed training recipes and data, enabling accessible and high-performance multimodal model development.
Findings
OpenVision encoders match or surpass CLIP performance.
Larger models improve multimodal task accuracy.
Smaller models enable efficient edge deployment.
Abstract
OpenAI's CLIP, released in early 2021, have long been the go-to choice of vision encoder for building multimodal foundation models. Although recent alternatives such as SigLIP have begun to challenge this status quo, to our knowledge none are fully open: their training data remains proprietary and/or their training recipes are not released. This paper fills this gap with OpenVision, a fully-open, cost-effective family of vision encoders that match or surpass the performance of OpenAI's CLIP when integrated into multimodal frameworks like LLaVA. OpenVision builds on existing works -- e.g., CLIPS for training framework and Recap-DataComp-1B for training data -- while revealing multiple key insights in enhancing encoder quality and showcasing practical benefits in advancing multimodal models. By releasing vision encoders spanning from 5.9M to 632.1M parameters, OpenVision offers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗UCSC-VLAA/openvision-vit-tiny-patch16-160model
- 🤗UCSC-VLAA/openvision-vit-tiny-patch16-224model· 1 dl1 dl
- 🤗UCSC-VLAA/openvision-vit-tiny-patch16-384model· 9 dl9 dl
- 🤗UCSC-VLAA/openvision-vit-tiny-patch8-160model
- 🤗UCSC-VLAA/openvision-vit-tiny-patch8-224model· 21 dl21 dl
- 🤗UCSC-VLAA/openvision-vit-tiny-patch8-384model
- 🤗UCSC-VLAA/openvision-vit-small-patch16-160model
- 🤗UCSC-VLAA/openvision-vit-small-patch16-224model· 341 dl341 dl
- 🤗UCSC-VLAA/openvision-vit-small-patch16-384model
- 🤗UCSC-VLAA/openvision-vit-small-patch8-160model· ♡ 1♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Sensor and Energy Harvesting Materials · Interactive and Immersive Displays
MethodsContrastive Language-Image Pre-training
