Where Am I and What Will I See: An Auto-Regressive Model for Spatial   Localization and View Prediction

Junyi Chen; Di Huang; Weicai Ye; Wanli Ouyang; Tong He

arXiv:2410.18962·cs.CV·October 25, 2024

Where Am I and What Will I See: An Auto-Regressive Model for Spatial Localization and View Prediction

Junyi Chen, Di Huang, Weicai Ye, Wanli Ouyang, Tong He

PDF

Open Access

TL;DR

This paper introduces GST, a novel auto-regressive framework that jointly performs spatial localization and view prediction, improving spatial reasoning capabilities in machines by capturing the interconnected nature of these tasks.

Contribution

The paper proposes a unified auto-regressive model with a new camera tokenization method that jointly estimates camera pose and predicts views, bridging spatial awareness and visual prediction.

Findings

01

Joint training improves both pose estimation and view prediction accuracy.

02

The model effectively captures the relationship between spatial location and visual perspective.

03

The approach demonstrates superior performance compared to separate task models.

Abstract

Spatial intelligence is the ability of a machine to perceive, reason, and act in three dimensions within space and time. Recent advancements in large-scale auto-regressive models have demonstrated remarkable capabilities across various reasoning tasks. However, these models often struggle with fundamental aspects of spatial reasoning, particularly in answering questions like "Where am I?" and "What will I see?". While some attempts have been done, existing approaches typically treat them as separate tasks, failing to capture their interconnected nature. In this paper, we present Generative Spatial Transformer (GST), a novel auto-regressive framework that jointly addresses spatial localization and view prediction. Our model simultaneously estimates the camera pose from a single image and predicts the view from a new camera pose, effectively bridging the gap between spatial awareness and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsLinear Layer · Dense Connections · Multi-Head Attention · Adam · Softmax · Dropout · Absolute Position Encodings · Label Smoothing · Attention Model · Byte Pair Encoding