4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

Chiao-An Yang; Ryo Hachiuma; Sifei Liu; Subhashree Radhakrishnan; Raymond A. Yeh; Yu-Chiang Frank Wang; Min-Hung Chen

arXiv:2512.17012·cs.CV·April 14, 2026

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen

PDF

1 Repo 1 Datasets

TL;DR

This paper introduces 4D-RGPT, a multimodal language model enhanced with perceptual distillation for better 4D understanding of videos, along with a new benchmark for region-level 4D reasoning.

Contribution

It presents a novel 4D-RGPT model, a perceptual distillation training framework, and a new R4D-Bench benchmark for region-level 4D video reasoning.

Findings

01

4D-RGPT outperforms existing models on 4D VQA benchmarks.

02

Perceptual 4D Distillation improves the model's temporal perception.

03

R4D-Bench enables more detailed region-level 4D reasoning evaluation.

Abstract

Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nvlabs/4D-RGPT
github

Datasets

nvidia/R4D-Bench
dataset· 109 dl
109 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.