HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation

Xin Zhou; Dingkang Liang; Xiwu Chen; Feiyang Tan; Dingyuan Zhang; Hengshuang Zhao; Xiang Bai

arXiv:2604.28196·cs.CV·May 1, 2026

HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation

Xin Zhou, Dingkang Liang, Xiwu Chen, Feiyang Tan, Dingyuan Zhang, Hengshuang Zhao, Xiang Bai

PDF

1 Repo

TL;DR

HERMES++ is a unified model that combines 3D scene understanding and future geometry prediction for autonomous driving, leveraging LLMs and geometric constraints to improve scene comprehension and prediction accuracy.

Contribution

It introduces a novel framework integrating multi-view BEV, LLM-enhanced queries, and geometric optimization to unify scene understanding and future prediction tasks.

Findings

01

Outperforms specialist models in point cloud prediction and scene understanding.

02

Effective integration of semantic and geometric information enhances accuracy.

03

Code and model will be publicly released for community use.

Abstract

Driving world models serve as a pivotal technology for autonomous driving by simulating environmental dynamics. However, existing approaches predominantly focus on future scene generation, often overlooking comprehensive 3D scene understanding. Conversely, while Large Language Models (LLMs) demonstrate impressive reasoning capabilities, they lack the capacity to predict future geometric evolution, creating a significant disparity between semantic interpretation and physical simulation. To bridge this gap, we propose HERMES++, a unified driving world model that integrates 3D scene understanding and future geometry prediction within a single framework. Our approach addresses the distinct requirements of these tasks through synergistic designs. First, a BEV representation consolidates multi-view spatial information into a structure compatible with LLMs. Second, we introduce LLM-enhanced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

H-EmbodVis/HERMESV2
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.