TikArt: Stabilizing Aperture-Guided Fine-Grained Visual Reasoning with Reinforcement Learning

Hao Ding; Zhichuan Yang; Weijie Ge; Ziqin Gao; Chaoyi Lu; Lei Zhao

arXiv:2602.14482·cs.CV·March 12, 2026

TikArt: Stabilizing Aperture-Guided Fine-Grained Visual Reasoning with Reinforcement Learning

Hao Ding, Zhichuan Yang, Weijie Ge, Ziqin Gao, Chaoyi Lu, Lei Zhao

PDF

Open Access

TL;DR

TikArt introduces an aperture-guided agent that enhances fine-grained visual reasoning in multimodal models by sequentially focusing on regions of interest, improving interpretability and accuracy in complex visual tasks.

Contribution

It proposes a novel aperture-guided reasoning framework with a TAO loop, integrating reinforcement learning and a new reward to stabilize long-horizon evidence collection.

Findings

01

Significant improvements in high-resolution reasoning tasks.

02

Enhanced performance in multimodal understanding and segmentation.

03

Effective transfer to pixel-level grounding tasks.

Abstract

Fine-grained visual reasoning in multimodal large language models (MLLMs) is bottlenecked by single-pass global image encoding: key evidence often lies in tiny objects, cluttered regions, subtle markings, or dense charts. We present \textbf{TikArt} (\textbf{T}h\textbf{i}n\textbf{k}ing \textbf{A}pe\textbf{rt}ure), an aperture-guided agent that formulates multimodal reasoning as sequential evidence acquisition over regions of interest. TikArt follows a Think--Aperture--Observe (TAO) loop that interleaves language reasoning with two aperture actions: Zoom, which extracts rectangular crops, and Segment, which invokes an off-the-shelf segmenter to produce object-centric mask-based views for irregular targets. A mandatory Observation step after every aperture action writes local evidence back into text, yielding interpretable aperture trajectories and persistent linguistic memory. Built on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications