ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

Wei Li; Jizhihui Liu; Li Yixing; Junwen Tong; Rui Shao; Liqiang Nie

arXiv:2605.05126·cs.RO·May 7, 2026

ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

Wei Li, Jizhihui Liu, Li Yixing, Junwen Tong, Rui Shao, Liqiang Nie

PDF

TL;DR

ConsisVLA-4D introduces a unified framework that significantly improves spatiotemporal perception and reasoning in robotic manipulation, achieving higher accuracy and faster inference by ensuring consistency across views, objects, and scenes.

Contribution

It proposes novel modules for cross-view, cross-object, and cross-scene consistency, advancing efficient 3D perception and 4D reasoning in vision-language-action models.

Findings

01

Achieves 21.6% and 41.5% performance improvements on benchmarks and real-world tasks.

02

Provides 2.3-fold and 2.4-fold inference speedups over previous models.

03

Demonstrates enhanced spatiotemporal consistency in robotic perception and reasoning.

Abstract

Current Vision-Language-Action (VLA) models primarily focus on mapping 2D observations to actions, but exhibit notable limitations in spatiotemporal perception and reasoning: 1) spatial representations often rely on additional sensors, introducing substantial computational overhead; 2) visual reasoning is typically limited to future-frame prediction, lacking alignment with the instruction-grounded scene and thus compromising spatiotemporal consistency. To address these challenges, we propose ConsisVLA-4D, a unified and efficient framework that enhances spatiotemporal consistency in 3D perception and 4D reasoning. Specifically, we design: 1) CV-Aligner, which ensures cross-view object semantic consistency by filtering instruction-relevant regions and aligning object identities across multiple viewpoints; 2) CO-Fuser, which guarantees cross-object spatial geometric consistency by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.