Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations

Zhihao Yuan; Shuyi Jiang; Chun-Mei Feng; Yaolun Zhang; Shuguang Cui; Zhen Li; Na Zhao

arXiv:2506.17545·cs.CV·June 24, 2025

Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations

Zhihao Yuan, Shuyi Jiang, Chun-Mei Feng, Yaolun Zhang, Shuguang Cui, Zhen Li, Na Zhao

PDF

TL;DR

Scene-R1 introduces a novel framework that enables 3D scene reasoning from videos without dense 3D annotations, combining reinforcement learning with a two-stage grounding pipeline for transparent and accurate understanding.

Contribution

It presents a new video-grounded approach that eliminates the need for 3D detectors and dense annotations, improving 3D scene understanding with explainability.

Findings

01

Outperforms existing open-vocabulary baselines on multiple datasets.

02

Provides transparent, step-by-step rationales for 3D scene reasoning.

03

Achieves accurate 3D understanding using only RGB-D videos and minimal annotations.

Abstract

Currently, utilizing large language models to understand the 3D world is becoming popular. Yet existing 3D-aware LLMs act as black boxes: they output bounding boxes or textual answers without revealing how those decisions are made, and they still rely on pre-trained 3D detectors to supply object proposals. We introduce Scene-R1, a video-grounded framework that learns to reason about 3D scenes without any point-wise 3D instance supervision by pairing reinforcement-learning-driven reasoning with a two-stage grounding pipeline. In the temporal grounding stage, we explicitly reason about the video and select the video snippets most relevant to an open-ended query. In the subsequent image grounding stage, we analyze the image and predict the 2D bounding box. After that, we track the object using SAM2 to produce pixel-accurate masks in RGB frames, and project them back into 3D, thereby…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.