PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training

Weifu Fu; Jinyang Li; Bin-Bin Gao; Jialin Li; Yuhuan Lin; Hanqiu Deng; Wenbing Tao; Yong Liu; Chengjie Wang

arXiv:2604.00503·cs.CV·April 8, 2026

PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training

Weifu Fu, Jinyang Li, Bin-Bin Gao, Jialin Li, Yuhuan Lin, Hanqiu Deng, Wenbing Tao, Yong Liu, Chengjie Wang

PDF

1 Repo 1 Models

TL;DR

PET-DINO introduces a universal object detector that leverages prompt-enriched training and alignment-friendly visual prompts to improve zero-shot detection across diverse scenarios.

Contribution

It proposes a novel unified detection framework with prompt-enriched training strategies, reducing development complexity and enhancing zero-shot detection performance.

Findings

01

PET-DINO achieves competitive zero-shot detection results.

02

The proposed training strategies enable effective modeling of multiple prompts.

03

The framework supports both text and visual prompts for versatile detection.

Abstract

Open-Set Object Detection (OSOD) enables recognition of novel categories beyond fixed classes but faces challenges in aligning text representations with complex visual concepts and the scarcity of image-text pairs for rare categories. This results in suboptimal performance in specialized domains or with complex objects. Recent visual-prompted methods partially address these issues but often involve complex multi-modal designs and multi-stage optimizations, prolonging the development cycle. Additionally, effective training strategies for data-driven OSOD models remain largely unexplored. To address these challenges, we propose PET-DINO, a universal detector supporting both text and visual prompts. Our Alignment-Friendly Visual Prompt Generation (AFVPG) module builds upon an advanced text-prompted detector, addressing the limitations of text representation guidance and reducing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://fuweifuvtoo.github.io/pet-dino
github

Models

🤗
fuweifu/PET-DINO
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.