Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation

Ziyu Zhu; Xilin Wang; Yixuan Li; Zhuofan Zhang; Xiaojian Ma; Yixin Chen; Baoxiong Jia; Wei Liang; Qian Yu; Zhidong Deng; Siyuan Huang; Qing Li

arXiv:2507.04047·cs.CV·July 31, 2025

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation

Ziyu Zhu, Xilin Wang, Yixuan Li, Zhuofan Zhang, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Wei Liang, Qian Yu, Zhidong Deng, Siyuan Huang, Qing Li

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper introduces MTU3D, a unified framework that combines active perception and 3D vision-language learning, enabling embodied agents to explore and understand environments more effectively without explicit 3D reconstruction.

Contribution

The paper proposes a novel online query-based representation learning method and a unified objective for grounding and exploration, advancing embodied scene understanding.

Findings

01

Outperforms state-of-the-art methods by 14-23% in success rate across benchmarks.

02

Enables navigation with diverse input modalities including language and images.

03

Achieves end-to-end trajectory learning over a million trajectories.

Abstract

Embodied scene understanding requires not only comprehending visual-spatial information that has been observed but also determining where to explore next in the 3D physical world. Existing 3D Vision-Language (3D-VL) models primarily focus on grounding objects in static observations from 3D reconstruction, such as meshes and point clouds, but lack the ability to actively perceive and explore their environment. To address this limitation, we introduce \underline{\textbf{M}}ove \underline{\textbf{t}}o \underline{\textbf{U}}nderstand (\textbf{\model}), a unified framework that integrates active perception with \underline{\textbf{3D}} vision-language learning, enabling embodied agents to effectively explore and understand their environment. This is achieved by three key innovations: 1) Online query-based representation learning, enabling direct spatial memory construction from RGB-D frames,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
bigai/MTU3D
model· ♡ 1
♡ 1

Datasets

bigai/MTU3D
dataset· 418 dl
418 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Human Motion and Animation