DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models

Chenbin Pan; Wenbin He; Zhengzhong Tu; Liu Ren

arXiv:2505.24025·cs.CV·August 4, 2025

DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models

Chenbin Pan, Wenbin He, Zhengzhong Tu, Liu Ren

PDF

Open Access

TL;DR

DINO-R1 introduces a reinforcement learning-based approach to enhance reasoning capabilities in vision foundation models, achieving superior performance on multiple visual datasets by incentivizing query-based reasoning.

Contribution

It is the first to apply reinforcement learning to incentivize reasoning in vision models, introducing GRQO and stabilizing techniques for improved visual reasoning performance.

Findings

01

Outperforms supervised baselines on COCO, LVIS, and ODinW datasets.

02

Achieves strong generalization in open-vocabulary and closed-set scenarios.

03

Demonstrates the effectiveness of reinforcement learning for reasoning in vision models.

Abstract

The recent explosive interest in the reasoning capabilities of large language models, such as DeepSeek-R1, has demonstrated remarkable success through reinforcement learning-based fine-tuning frameworks, exemplified by methods like Group Relative Policy Optimization (GRPO). However, such reasoning abilities remain underexplored and notably absent in vision foundation models, including representation models like the DINO series. In this work, we propose \textbf{DINO-R1}, the first such attempt to incentivize visual in-context reasoning capabilities of vision foundation models using reinforcement learning. Specifically, DINO-R1 introduces \textbf{Group Relative Query Optimization (GRQO)}, a novel reinforcement-style training strategy explicitly designed for query-based representation models, which computes query-level rewards based on group-normalized alignment quality. We also apply…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuction Theory and Applications

MethodsLinear Layer · Softmax · Multi-Head Attention · Attention Is All You Need · Residual Connection · Layer Normalization · Dense Connections · Vision Transformer · self-DIstillation with NO labels