Text-guided Visual Prompt DINO for Generic Segmentation

Yuchen Guan; Chong Sun; Canmiao Fu; Zhipeng Huang; Chun Yuan; Chen Li

arXiv:2508.06146·cs.CV·August 11, 2025

Text-guided Visual Prompt DINO for Generic Segmentation

Yuchen Guan, Chong Sun, Canmiao Fu, Zhipeng Huang, Chun Yuan, Chen Li

PDF

Open Access

TL;DR

Prompt-DINO introduces an innovative multimodal segmentation framework with early fusion, order-aligned queries, and a large-scale synthetic data engine, achieving state-of-the-art results in open-world detection tasks.

Contribution

The paper presents a novel framework combining early cross-modal fusion, optimized query alignment, and a large synthetic dataset generation method for improved open-world segmentation.

Findings

01

Achieves state-of-the-art performance on open-world detection benchmarks.

02

Reduces label noise by 80.5% using synthetic data.

03

Expands semantic coverage beyond fixed vocabularies.

Abstract

Recent advancements in multimodal vision models have highlighted limitations in late-stage feature fusion and suboptimal query selection for hybrid prompts open-world segmentation, alongside constraints from caption-derived vocabularies. To address these challenges, we propose Prompt-DINO, a text-guided visual Prompt DINO framework featuring three key innovations. First, we introduce an early fusion mechanism that unifies text/visual prompts and backbone features at the initial encoding stage, enabling deeper cross-modal interactions to resolve semantic ambiguities. Second, we design order-aligned query selection for DETR-based architectures, explicitly optimizing the structural alignment between text and visual queries during decoding to enhance semantic-spatial consistency. Third, we develop a generative data engine powered by the Recognize Anything via Prompting (RAP) model, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques