LSA: Localized Semantic Alignment for Enhancing Temporal Consistency in Traffic Video Generation

Mirlan Karimov; Teodora Spasojevic; Markus Braun; Julian Wiederer; Vasileios Belagiannis; Marc Pollefeys

arXiv:2602.05966·cs.CV·February 6, 2026

LSA: Localized Semantic Alignment for Enhancing Temporal Consistency in Traffic Video Generation

Mirlan Karimov, Teodora Spasojevic, Markus Braun, Julian Wiederer, Vasileios Belagiannis, Marc Pollefeys

PDF

Open Access

TL;DR

This paper introduces Localized Semantic Alignment (LSA), a fine-tuning framework that improves temporal consistency in traffic video generation by aligning semantic features, eliminating the need for control signals during inference.

Contribution

We propose LSA, a novel fine-tuning method that enhances temporal consistency in pre-trained video generation models through semantic feature alignment around dynamic objects.

Findings

01

LSA outperforms baselines in standard video generation metrics.

02

The approach improves temporal consistency without additional inference overhead.

03

Effective on nuScenes and KITTI datasets.

Abstract

Controllable video generation has emerged as a versatile tool for autonomous driving, enabling realistic synthesis of traffic scenarios. However, existing methods depend on control signals at inference time to guide the generative model towards temporally consistent generation of dynamic objects, limiting their utility as scalable and generalizable data engines. In this work, we propose Localized Semantic Alignment (LSA), a simple yet effective framework for fine-tuning pre-trained video generation models. LSA enhances temporal consistency by aligning semantic features between ground-truth and generated video clips. Specifically, we compare the output of an off-the-shelf feature extraction model between the ground-truth and generated video clips localized around dynamic objects inducing a semantic feature consistency loss. We fine-tune the base model by combining this loss with the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Tensor decomposition and applications