Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning

Xuejun Zhang; Aditi Tiwari; Zhenhailong Wang; Heng Ji

arXiv:2602.06041·cs.CV·February 9, 2026

Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning

Xuejun Zhang, Aditi Tiwari, Zhenhailong Wang, Heng Ji

PDF

Open Access

TL;DR

This paper introduces CAMCUE, a framework that uses camera pose information to improve multi-view spatial reasoning in large language models, enabling accurate and fast viewpoint prediction from natural language descriptions.

Contribution

CAMCUE is the first pose-aware multi-image model that explicitly incorporates camera pose for cross-view fusion and novel view synthesis, improving spatial reasoning accuracy and efficiency.

Findings

01

Achieves over 90% rotation accuracy within 20 degrees.

02

Reduces inference time from 256.6s to 1.45s per example.

03

Improves overall accuracy by 9.06%.

Abstract

Multi-image spatial reasoning remains challenging for current multimodal large language models (MLLMs). While single-view perception is inherently 2D, reasoning over multiple views requires building a coherent scene understanding across viewpoints. In particular, we study perspective taking, where a model must build a coherent 3D understanding from multi-view observations and use it to reason from a new, language-specified viewpoint. We introduce CAMCUE, a pose-aware multi-image framework that uses camera pose as an explicit geometric anchor for cross-view fusion and novel-view reasoning. CAMCUE injects per-view pose into visual tokens, grounds natural-language viewpoint descriptions to a target camera pose, and synthesizes a pose-conditioned imagined target view to support answering. To support this setting, we curate CAMCUE-DATA with 27,668 training and 508 test instances pairing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI