DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models
Chenbin Pan, Wenbin He, Zhengzhong Tu, Liu Ren

TL;DR
DINO-R1 introduces a reinforcement learning-based approach to enhance reasoning capabilities in vision foundation models, achieving superior performance on multiple visual datasets by incentivizing query-based reasoning.
Contribution
It is the first to apply reinforcement learning to incentivize reasoning in vision models, introducing GRQO and stabilizing techniques for improved visual reasoning performance.
Findings
Outperforms supervised baselines on COCO, LVIS, and ODinW datasets.
Achieves strong generalization in open-vocabulary and closed-set scenarios.
Demonstrates the effectiveness of reinforcement learning for reasoning in vision models.
Abstract
The recent explosive interest in the reasoning capabilities of large language models, such as DeepSeek-R1, has demonstrated remarkable success through reinforcement learning-based fine-tuning frameworks, exemplified by methods like Group Relative Policy Optimization (GRPO). However, such reasoning abilities remain underexplored and notably absent in vision foundation models, including representation models like the DINO series. In this work, we propose \textbf{DINO-R1}, the first such attempt to incentivize visual in-context reasoning capabilities of vision foundation models using reinforcement learning. Specifically, DINO-R1 introduces \textbf{Group Relative Query Optimization (GRQO)}, a novel reinforcement-style training strategy explicitly designed for query-based representation models, which computes query-level rewards based on group-normalized alignment quality. We also apply…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuction Theory and Applications
MethodsLinear Layer · Softmax · Multi-Head Attention · Attention Is All You Need · Residual Connection · Layer Normalization · Dense Connections · Vision Transformer · self-DIstillation with NO labels
