dinov3.seg: Open-Vocabulary Semantic Segmentation with DINOv3
Saikat Dutta, Biplab Banerjee, Hamid Rezatofighi

TL;DR
dinov3.seg advances open-vocabulary semantic segmentation by integrating a tailored architecture, dual-level text embedding, early visual refinement, and high-resolution inference, leading to superior accuracy and robustness in complex scenes.
Contribution
It introduces a novel framework extending dinov3.txt with task-specific design, dual-level text embedding, and a high-resolution inference strategy for improved OVSS performance.
Findings
Consistently outperforms state-of-the-art methods on five benchmarks.
Enhances spatial precision and robustness in cluttered scenes.
Effectively combines semantic and spatial information for dense prediction.
Abstract
Open-Vocabulary Semantic Segmentation (OVSS) assigns pixel-level labels from an open set of text-defined categories, demanding reliable generalization to unseen classes at inference. Although modern vision-language models (VLMs) support strong open-vocabulary recognition, their representations learned through global contrastive objectives remain suboptimal for dense prediction, prompting many OVSS methods to depend on limited adaptation or refinement of image-text similarity maps. This, in turn, restricts spatial precision and robustness in complex, cluttered scenes. We introduce dinov3.seg, extending dinov3.txt into a dedicated framework for OVSS. Our contributions are four-fold. First, we design a task-specific architecture tailored to this backbone, systematically adapting established design principles from prior open-vocabulary segmentation work. Second, we jointly leverage text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
