OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance

Haoxi Zeng; Qiankun Liu; Yi Bin; Haiyue Zhang; Yujuan Ding; Guoqing Wang; Deqiang Ouyang; Heng Tao Shen

arXiv:2604.08461·cs.CV·April 10, 2026

OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance

Haoxi Zeng, Qiankun Liu, Yi Bin, Haiyue Zhang, Yujuan Ding, Guoqing Wang, Deqiang Ouyang, Heng Tao Shen

PDF

TL;DR

This paper introduces OVS-DINO, a novel framework that enhances open-vocabulary segmentation by revitalizing DINO's boundary awareness through structural alignment with SAM, achieving state-of-the-art results.

Contribution

The paper proposes a structure-aware framework that aligns DINO with SAM to improve boundary perception in open-vocabulary segmentation tasks.

Findings

01

Achieves a 2.1% improvement in average benchmark scores.

02

Significantly improves segmentation in cluttered scenes by 6.3%.

03

Demonstrates state-of-the-art performance across multiple benchmarks.

Abstract

Open-Vocabulary Segmentation (OVS) aims to segment image regions beyond predefined category sets by leveraging semantic descriptions. While CLIP based approaches excel in semantic generalization, they frequently lack the fine-grained spatial awareness required for dense prediction. Recent efforts have incorporated Vision Foundation Models (VFMs) like DINO to alleviate these limitations. However, these methods still struggle with the precise edge perception necessary for high fidelity segmentation. In this paper, we analyze internal representations of DINO and discover that its inherent boundary awareness is not absent but rather undergoes progressive attenuation as features transition into deeper transformer blocks. To address this, we propose OVS-DINO, a novel framework that revitalizes latent edge-sensitivity of DINO through structural alignment with the Segment Anything Model (SAM).…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.