From Correspondence to Actions: Human-Like Multi-Image Spatial Reasoning in Multi-modal Large Language Models

Masanari Oi; Koki Maeda; Ryuto Koike; Daisuke Oba; Nakamasa Inoue; Naoaki Okazaki

arXiv:2602.08735·cs.CV·February 11, 2026

From Correspondence to Actions: Human-Like Multi-Image Spatial Reasoning in Multi-modal Large Language Models

Masanari Oi, Koki Maeda, Ryuto Koike, Daisuke Oba, Nakamasa Inoue, Naoaki Okazaki

PDF

Open Access

TL;DR

This paper introduces HATCH, a training framework for multi-modal large language models that enhances multi-image spatial reasoning by explicitly modeling cross-view correspondence and viewpoint transformations, leading to improved performance.

Contribution

It presents a novel training approach with explicit supervision for cross-view correspondence and viewpoint change, advancing multi-image spatial reasoning in large language models.

Findings

01

HATCH outperforms comparable models on three benchmarks.

02

It achieves results competitive with larger models.

03

The framework preserves single-image reasoning capabilities.

Abstract

While multimodal large language models (MLLMs) have made substantial progress in single-image spatial reasoning, multi-image spatial reasoning, which requires integration of information from multiple viewpoints, remains challenging. Cognitive studies suggest that humans address such tasks through two mechanisms: cross-view correspondence, which identifies regions across different views that correspond to the same physical locations, and stepwise viewpoint transformation, which composes relative viewpoint changes sequentially. However, existing studies incorporate these mechanisms only partially and often implicitly, without explicit supervision for both. We propose Human-Aware Training for Cross-view correspondence and viewpoint cHange (HATCH), a training framework with two complementary objectives: (1) Patch-Level Spatial Alignment, which encourages patch representations to align…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Spatial Cognition and Navigation