Annotation-Free Visual Reasoning for High-Resolution Large Multimodal Models via Reinforcement Learning

Jiacheng Yang; Anqi Chen; Yunkai Dang; Qi Fan; Cong Wang; Wenbin Li; Feng Miao; Yang Gao

arXiv:2602.23615·cs.CV·March 10, 2026

Annotation-Free Visual Reasoning for High-Resolution Large Multimodal Models via Reinforcement Learning

Jiacheng Yang, Anqi Chen, Yunkai Dang, Qi Fan, Cong Wang, Wenbin Li, Feng Miao, Yang Gao

PDF

Open Access

TL;DR

This paper introduces HART, a reinforcement learning-based framework that enables large multimodal models to reason with high-resolution images without needing costly visual annotations, improving accuracy and explainability.

Contribution

HART is a novel closed-loop, annotation-free method that enhances high-resolution visual reasoning in multimodal models through self-verification and reinforcement learning.

Findings

01

HART outperforms baseline models on multiple high-resolution visual reasoning benchmarks.

02

HART provides explainable reasoning pathways and improves localization accuracy.

03

HART reduces reliance on costly human-annotated grounding labels.

Abstract

Current Large Multimodal Models (LMMs) struggle with high-resolution visual inputs during the reasoning process, as the number of image tokens increases quadratically with resolution, introducing substantial redundancy and irrelevant information. A common practice is to identify key image regions and refer to their high-resolution counterparts during reasoning, typically trained with external visual supervision. However, such visual supervision cues require costly grounding labels from human annotators. Meanwhile, it remains an open question how to enhance a model's grounding abilities to support reasoning without relying on additional annotations. In this paper, we propose High-resolution Annotation-free Reasoning Technique (HART), a closed-loop framework that enables LMMs to focus on and self-verify key regions of high-resolution visual inputs. HART incorporates a post-training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Advanced Neural Network Applications