ZERO: Industry-ready Vision Foundation Model with Multi-modal Prompts
Sangbum Choi, Kyeongryeol Go, Taewoong Jang

TL;DR
ZERO is an industry-ready vision foundation model that uses multi-modal prompts and training on a proprietary dataset to excel in zero-shot industrial applications, outperforming existing models on various benchmarks.
Contribution
Introduces ZERO, the first vision foundation model designed specifically for zero-shot industrial applications using multi-modal prompts and a proprietary dataset.
Findings
Competitive performance on LVIS-Val benchmark
Outperforms existing models on 37 industrial datasets
Achieved top placements in CVPR 2025 challenges
Abstract
Foundation models have revolutionized AI, yet they struggle with zero-shot deployment in real-world industrial settings due to a lack of high-quality, domain-specific datasets. To bridge this gap, Superb AI introduces ZERO, an industry-ready vision foundation model that leverages multi-modal prompting (textual and visual) for generalization without retraining. Trained on a compact yet representative 0.9 million annotated samples from a proprietary billion-scale industrial dataset, ZERO demonstrates competitive performance on academic benchmarks like LVIS-Val and significantly outperforms existing models across 37 diverse industrial datasets. Furthermore, ZERO achieved 2nd place in the CVPR 2025 Object Instance Detection Challenge and 4th place in the Foundational Few-shot Object Detection Challenge, highlighting its practical deployability and generalizability with minimal adaptation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
