Image Tiling for High-Resolution Reasoning: Balancing Local Detail with Global Context

Anatole Jacquin de Margerie; Alexis Roger; Irina Rish

arXiv:2512.11167·cs.CV·December 15, 2025

Image Tiling for High-Resolution Reasoning: Balancing Local Detail with Global Context

Anatole Jacquin de Margerie, Alexis Roger, Irina Rish

PDF

Open Access

TL;DR

This paper reproduces and analyzes the Monkey Vision-Language Model's image tiling approach for high-resolution image understanding, confirming its effectiveness and exploring the impact of global context inclusion on performance.

Contribution

It provides a detailed reproduction of the Monkey VLM's tiling method and extends the analysis by investigating the role of global context in high-resolution multimodal models.

Findings

01

Tiling recovers local visual details effectively.

02

Including global context influences model performance.

03

Results vary depending on task and tile size.

Abstract

Reproducibility remains a cornerstone of scientific progress, yet complex multimodal models often lack transparent implementation details and accessible training infrastructure. In this work, we present a detailed reproduction and critical analysis of the Monkey Vision-Language Model (VLM) (Li et al. 2023b) published in CVPR24, a recent approach to high-resolution image understanding via image tiling. The original paper proposed splitting large images into tiles to recover fine-grained visual details while maintaining computational efficiency. Our study replicates this strategy using open checkpoints and reimplements the training pipeline. We confirm the key finding of the original Monkey VLM work, namely that tiling effectively recovers local details. We then extend this work further, by investigating the effect of the inclusion of the global context, which provide practical insights…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning