Token Warping Helps MLLMs Look from Nearby Viewpoints

Phillip Y. Lee; Chanho Park; Mingue Park; Seungwoo Yoo; Juil Koo; Minhyuk Sung

arXiv:2604.02870·cs.CV·April 6, 2026

Token Warping Helps MLLMs Look from Nearby Viewpoints

Phillip Y. Lee, Chanho Park, Mingue Park, Seungwoo Yoo, Juil Koo, Minhyuk Sung

PDF

1 Repo

TL;DR

This paper introduces token warping techniques for multimodal large language models, demonstrating that backward token warping enhances viewpoint change robustness and improves scene understanding from nearby perspectives.

Contribution

It proposes a novel token warping method, especially backward warping, that outperforms pixel-wise warping and other baselines in viewpoint reasoning tasks.

Findings

01

Backward token warping outperforms pixel-wise warping.

02

Token warping improves viewpoint robustness in MLLMs.

03

Experiments on ViewBench show superior reasoning accuracy.

Abstract

Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kaist-visual-ai-group/Token-Warping-MLLM
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.