DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation

Zhechao Wang; Yiming Zeng; Lufan Ma; Zeqing Fu; Chen Bai; Ziyao Lin; Cheng Lu

arXiv:2602.22549·cs.CV·February 27, 2026

DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation

Zhechao Wang, Yiming Zeng, Lufan Ma, Zeqing Fu, Chen Bai, Ziyao Lin, Cheng Lu

PDF

Open Access

TL;DR

DrivePTS is a novel framework that enhances driving scene generation by using progressive learning, detailed multi-view textual descriptions, and frequency-guided structural loss to improve diversity, fidelity, and controllability.

Contribution

It introduces a progressive learning strategy with mutual information constraints, multi-view hierarchical textual guidance, and a frequency-guided loss for better scene synthesis.

Findings

01

Achieves state-of-the-art fidelity and controllability.

02

Successfully generates rare and complex driving scenes.

03

Improves structural and semantic details in generated scenes.

Abstract

Synthesis of diverse driving scenes serves as a crucial data augmentation technique for validating the robustness and generalizability of autonomous driving systems. Current methods aggregate high-definition (HD) maps and 3D bounding boxes as geometric conditions in diffusion models for conditional scene generation. However, implicit inter-condition dependency causes generation failures when control conditions change independently. Additionally, these methods suffer from insufficient details in both semantic and structural aspects. Specifically, brief and view-invariant captions restrict semantic contexts, resulting in weak background modeling. Meanwhile, the standard denoising loss with uniform spatial weighting neglects foreground structural details, causing visual distortions and blurriness. To address these challenges, we propose DrivePTS, which incorporates three key innovations.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis