UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

Hao Lu; Ziyang Liu; Guangfeng Jiang; Yuanfei Luo; Sheng Chen; Yangang Zhang; and Ying-Cong Chen

arXiv:2512.09864·cs.CV·December 11, 2025

UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

Hao Lu, Ziyang Liu, Guangfeng Jiang, Yuanfei Luo, Sheng Chen, Yangang Zhang, and Ying-Cong Chen

PDF

Open Access

TL;DR

UniUGP is a unified framework that combines scene understanding, future video generation, and trajectory planning to improve autonomous driving in complex, long-tail scenarios by leveraging visual dynamics and semantic reasoning.

Contribution

The paper introduces UniUGP, a novel hybrid architecture that unifies understanding, generation, and planning for autonomous driving, integrating pre-trained models and specialized datasets for enhanced reasoning and decision-making.

Findings

01

Achieves state-of-the-art in perception, reasoning, and decision-making.

02

Demonstrates superior generalization to challenging long-tail scenarios.

03

Effectively leverages unlabeled videos and large language models.

Abstract

Autonomous driving (AD) systems struggle in long-tail scenarios due to limited world knowledge and weak visual dynamic modeling. Existing vision-language-action (VLA)-based methods cannot leverage unlabeled videos for visual causal learning, while world model-based methods lack reasoning capabilities from large language models. In this paper, we construct multiple specialized datasets providing reasoning and planning annotations for complex scenarios. Then, a unified Understanding-Generation-Planning framework, named UniUGP, is proposed to synergize scene reasoning, future video generation, and trajectory planning through a hybrid expert architecture. By integrating pre-trained VLMs and video generation models, UniUGP leverages visual dynamics and semantic reasoning to enhance planning performance. Taking multi-frame observations and language instructions as input, it produces…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Autonomous Vehicle Technology and Safety · Generative Adversarial Networks and Image Synthesis