DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes
Sungjune Park, Hyunjun Kim, Junho Kim, Seongho Kim, Yong Man Ro

TL;DR
This paper introduces DIP-R1, an RL-based framework that significantly improves the fine-grained visual perception of multimodal large language models in complex, real-world scenes by guiding detailed scene inspection and uncertainty exploration.
Contribution
The paper presents a novel RL framework, DIP-R1, that enhances MLLMs' visual perception through rule-based rewards for scene comprehension, uncertainty exploration, and decision accuracy.
Findings
DIP-R1 outperforms existing baselines in complex scene understanding.
It improves MLLMs' ability to reason about ambiguous regions.
Significant gains are observed in both in-domain and out-of-domain scenarios.
Abstract
MLLMs have demonstrated significant visual understanding capabilities, yet their fine-grained visual perception in complex real-world scenarios, such as densely crowded public areas, remains limited. Inspired by the recent success of RL in both LLMs and MLLMs, in this paper, we explore how RL can enhance visual perception ability of MLLMs. Then we develop a novel RL-based framework, Deep Inspection and Perception with RL (DIP-R1) designed to enhance the visual perception capabilities of MLLMs, by comprehending complex scenes and looking through visual instances closely. DIP-R1 guides MLLMs through detailed inspection of visual scene via three simply designed rule-based reward modeling. First, we adopt a standard reasoning reward encouraging the model to include three-step reasoning process: 1) comprehending entire visual scene, 2) observing for looking through interested but ambiguous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Multimodal Machine Learning Applications · Advanced Neural Network Applications
MethodsADaptive gradient method with the OPTimal convergence rate
