ZERO: Industry-ready Vision Foundation Model with Multi-modal Prompts

Sangbum Choi; Kyeongryeol Go; Taewoong Jang

arXiv:2507.04270·cs.CV·November 10, 2025

ZERO: Industry-ready Vision Foundation Model with Multi-modal Prompts

Sangbum Choi, Kyeongryeol Go, Taewoong Jang

PDF

Open Access

TL;DR

ZERO is an industry-ready vision foundation model that uses multi-modal prompts and training on a proprietary dataset to excel in zero-shot industrial applications, outperforming existing models on various benchmarks.

Contribution

Introduces ZERO, the first vision foundation model designed specifically for zero-shot industrial applications using multi-modal prompts and a proprietary dataset.

Findings

01

Competitive performance on LVIS-Val benchmark

02

Outperforms existing models on 37 industrial datasets

03

Achieved top placements in CVPR 2025 challenges

Abstract

Foundation models have revolutionized AI, yet they struggle with zero-shot deployment in real-world industrial settings due to a lack of high-quality, domain-specific datasets. To bridge this gap, Superb AI introduces ZERO, an industry-ready vision foundation model that leverages multi-modal prompting (textual and visual) for generalization without retraining. Trained on a compact yet representative 0.9 million annotated samples from a proprietary billion-scale industrial dataset, ZERO demonstrates competitive performance on academic benchmarks like LVIS-Val and significantly outperforms existing models across 37 diverse industrial datasets. Furthermore, ZERO achieved 2nd place in the CVPR 2025 Object Instance Detection Challenge and 4th place in the Foundational Few-shot Object Detection Challenge, highlighting its practical deployability and generalizability with minimal adaptation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications