LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models

Shuai Wang; Daoan Zhang; Tianyi Bai; Shitong Shao; Jiebo Luo; Jiaheng Wei

arXiv:2511.19261·cs.CV·November 25, 2025

LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models

Shuai Wang, Daoan Zhang, Tianyi Bai, Shitong Shao, Jiebo Luo, Jiaheng Wei

PDF

Open Access

TL;DR

LAST introduces a novel approach for general vision-language models to jointly understand 3D space and long videos by enabling visual thinking trajectories, significantly improving performance in spatial, video, and image understanding tasks.

Contribution

The paper presents LAST, a method that enhances VLMs to think in space and time using only 2D images, unifying 3D and video understanding without specialized architectures.

Findings

01

15.8% gains on EgoSchema in zero-shot setting

02

8.3 gains on VSI-Bench over Qwen2.5-VL-7B

03

Substantial improvements across multiple benchmarks

Abstract

Humans can perceive and understand 3D space and long videos from sequential visual observations. But do vision-language models (VLMs) can? Recent work demonstrates that even state-of-the-art VLMs still struggle to understand 3D space and long videos, although they are powerful in typical vision-language tasks. Current methods often rely on specialized architectural designs to improve performance for 3D tasks and video understanding tasks separately. In contrast, we propose LAST, short for LeArn to Think in Space and Time, to jointly improve 3D spatial and long video understanding for general VLMs with only a set of 2D images as inputs. LAST makes VLMs think in space and time rather than only with text before giving the final answer, building visual thinking trajectories in 3D space and temporal dimension. We demonstrate the effectiveness of LAST in two scenarios: 1) zero-shot, where we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications