Driving with DINO: Vision Foundation Features as a Unified Bridge for Sim-to-Real Generation in Autonomous Driving

Xuyang Chen; Conglang Zhang; Chuanheng Fu; Zihao Yang; Kaixuan Zhou; Yizhi Zhang; Jianan He; Yanfeng Zhang; Mingwei Sun; Zengmao Wang; Zhen Dong; Xiaoxiao Long; Liqiu Meng

arXiv:2602.06159·cs.CV·February 10, 2026

Driving with DINO: Vision Foundation Features as a Unified Bridge for Sim-to-Real Generation in Autonomous Driving

Xuyang Chen, Conglang Zhang, Chuanheng Fu, Zihao Yang, Kaixuan Zhou, Yizhi Zhang, Jianan He, Yanfeng Zhang, Mingwei Sun, Zengmao Wang, Zhen Dong, Xiaoxiao Long, Liqiu Meng

PDF

Open Access

TL;DR

This paper introduces Driving with DINO (DwD), a novel framework that uses vision foundation features to improve the realism and control in simulated-to-real autonomous driving video generation, addressing the domain gap effectively.

Contribution

The work proposes a unified feature-based approach leveraging DINO features, novel dimensionality reduction techniques, a learnable spatial alignment, and causal temporal aggregation to enhance sim-to-real transfer in autonomous driving videos.

Findings

01

Effective domain gap bridging with DINO features

02

Improved control precision and realism in generated videos

03

Enhanced temporal stability and reduced motion blur

Abstract

Driven by the emergence of Controllable Video Diffusion, existing Sim2Real methods for autonomous driving video generation typically rely on explicit intermediate representations to bridge the domain gap. However, these modalities face a fundamental Consistency-Realism Dilemma. Low-level signals (e.g., edges, blurred images) ensure precise control but compromise realism by "baking in" synthetic artifacts, whereas high-level priors (e.g., depth, semantics, HDMaps) facilitate photorealism but lack the structural detail required for consistent guidance. In this work, we present Driving with DINO (DwD), a novel framework that leverages Vision Foundation Module (VFM) features as a unified bridge between the simulation and real-world domains. We first identify that these features encode a spectrum of information, from high-level semantics to fine-grained structure. To effectively utilize…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Advanced Vision and Imaging