TL;DR
GaussianDWM introduces a unified 3D Gaussian scene representation for driving world models, enabling enhanced scene understanding and multi-modal generation with aligned textual information.
Contribution
It proposes a novel 3D Gaussian scene representation that aligns textual features with 3D scenes and integrates language-guided sampling for improved multi-modal driving environment modeling.
Findings
Achieves state-of-the-art performance on nuScenes and NuInteract datasets.
Effectively aligns textual information with 3D scenes using Gaussian primitives.
Demonstrates improved multi-modal scene generation and understanding.
Abstract
Driving World Models (DWMs) have been developing rapidly with the advances of generative models. However, existing DWMs lack 3D scene understanding capabilities and can only generate content conditioned on input data, without the ability to interpret or reason about the driving environment. Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multi-modal scene generation, while also enabling contextual enrichment for understanding and generation tasks. Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive, thereby achieving early…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · 3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis
