Text-guided Visual Prompt DINO for Generic Segmentation
Yuchen Guan, Chong Sun, Canmiao Fu, Zhipeng Huang, Chun Yuan, Chen Li

TL;DR
Prompt-DINO introduces an innovative multimodal segmentation framework with early fusion, order-aligned queries, and a large-scale synthetic data engine, achieving state-of-the-art results in open-world detection tasks.
Contribution
The paper presents a novel framework combining early cross-modal fusion, optimized query alignment, and a large synthetic dataset generation method for improved open-world segmentation.
Findings
Achieves state-of-the-art performance on open-world detection benchmarks.
Reduces label noise by 80.5% using synthetic data.
Expands semantic coverage beyond fixed vocabularies.
Abstract
Recent advancements in multimodal vision models have highlighted limitations in late-stage feature fusion and suboptimal query selection for hybrid prompts open-world segmentation, alongside constraints from caption-derived vocabularies. To address these challenges, we propose Prompt-DINO, a text-guided visual Prompt DINO framework featuring three key innovations. First, we introduce an early fusion mechanism that unifies text/visual prompts and backbone features at the initial encoding stage, enabling deeper cross-modal interactions to resolve semantic ambiguities. Second, we design order-aligned query selection for DETR-based architectures, explicitly optimizing the structural alignment between text and visual queries during decoding to enhance semantic-spatial consistency. Third, we develop a generative data engine powered by the Recognize Anything via Prompting (RAP) model, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques
