Seeing Clearly, Forgetting Deeply: Revisiting Fine-Tuned Video Generators for Driving Simulation
Chun-Peng Chang, Chen-Yu Wang, Julian Schmidt, Holger Caesar, Alain Pagani

TL;DR
This paper examines how fine-tuning video generators for driving simulation enhances visual quality but may reduce dynamic accuracy, proposing continual learning as a balanced alternative.
Contribution
It reveals a trade-off between visual fidelity and dynamic accuracy in fine-tuned video models for driving data and suggests continual learning to mitigate this issue.
Findings
Fine-tuning improves visual quality but can degrade spatial accuracy.
Driving scenes' regularity allows models to focus on dominant motion patterns.
Continual learning strategies help preserve dynamic details while maintaining quality.
Abstract
Recent advancements in video generation have substantially improved visual quality and temporal coherence, making these models increasingly appealing for applications such as autonomous driving, particularly in the context of driving simulation and so-called "world models". In this work, we investigate the effects of existing fine-tuning video generation approaches on structured driving datasets and uncover a potential trade-off: although visual fidelity improves, spatial accuracy in modeling dynamic elements may degrade. We attribute this degradation to a shift in the alignment between visual quality and dynamic understanding objectives. In datasets with diverse scene structures within temporal space, where objects or perspective shift in varied ways, these objectives tend to highly correlated. However, the very regular and repetitive nature of driving scenes allows visual quality to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
