Annotation-Free Visual Reasoning for High-Resolution Large Multimodal Models via Reinforcement Learning
Jiacheng Yang, Anqi Chen, Yunkai Dang, Qi Fan, Cong Wang, Wenbin Li, Feng Miao, Yang Gao

TL;DR
This paper introduces HART, a reinforcement learning-based framework that enables large multimodal models to reason with high-resolution images without needing costly visual annotations, improving accuracy and explainability.
Contribution
HART is a novel closed-loop, annotation-free method that enhances high-resolution visual reasoning in multimodal models through self-verification and reinforcement learning.
Findings
HART outperforms baseline models on multiple high-resolution visual reasoning benchmarks.
HART provides explainable reasoning pathways and improves localization accuracy.
HART reduces reliance on costly human-annotated grounding labels.
Abstract
Current Large Multimodal Models (LMMs) struggle with high-resolution visual inputs during the reasoning process, as the number of image tokens increases quadratically with resolution, introducing substantial redundancy and irrelevant information. A common practice is to identify key image regions and refer to their high-resolution counterparts during reasoning, typically trained with external visual supervision. However, such visual supervision cues require costly grounding labels from human annotators. Meanwhile, it remains an open question how to enhance a model's grounding abilities to support reasoning without relying on additional annotations. In this paper, we propose High-resolution Annotation-free Reasoning Technique (HART), a closed-loop framework that enables LMMs to focus on and self-verify key regions of high-resolution visual inputs. HART incorporates a post-training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Advanced Neural Network Applications
