TL;DR
HERMES++ is a unified model that combines 3D scene understanding and future geometry prediction for autonomous driving, leveraging LLMs and geometric constraints to improve scene comprehension and prediction accuracy.
Contribution
It introduces a novel framework integrating multi-view BEV, LLM-enhanced queries, and geometric optimization to unify scene understanding and future prediction tasks.
Findings
Outperforms specialist models in point cloud prediction and scene understanding.
Effective integration of semantic and geometric information enhances accuracy.
Code and model will be publicly released for community use.
Abstract
Driving world models serve as a pivotal technology for autonomous driving by simulating environmental dynamics. However, existing approaches predominantly focus on future scene generation, often overlooking comprehensive 3D scene understanding. Conversely, while Large Language Models (LLMs) demonstrate impressive reasoning capabilities, they lack the capacity to predict future geometric evolution, creating a significant disparity between semantic interpretation and physical simulation. To bridge this gap, we propose HERMES++, a unified driving world model that integrates 3D scene understanding and future geometry prediction within a single framework. Our approach addresses the distinct requirements of these tasks through synergistic designs. First, a BEV representation consolidates multi-view spatial information into a structure compatible with LLMs. Second, we introduce LLM-enhanced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
