DriveCamSim: Generalizable Camera Simulation via Explicit Camera Modeling for Autonomous Driving
Wenchao Sun, Xuewu Lin, Keyu Chen, Zixiang Pei, Yining Shi, Chuang Zhang, Sifa Zheng

TL;DR
DriveCamSim introduces a generalizable camera simulation framework for autonomous driving that explicitly models camera parameters and maintains high visual quality, enabling flexible, controllable, and robust multi-view video generation.
Contribution
The paper proposes Explicit Camera Modeling (ECM) for decoupling camera configuration from the model, enhancing generalization and controllability in camera simulation for autonomous driving.
Findings
Superior visual quality and controllability demonstrated.
Effective generalization across camera parameters and frame rates.
Enhanced temporal consistency and identity-awareness.
Abstract
Camera sensor simulation serves as a critical role for autonomous driving (AD), e.g. evaluating vision-based AD algorithms. While existing approaches have leveraged generative models for controllable image/video generation, they remain constrained to generating multi-view video sequences with fixed camera viewpoints and video frequency, significantly limiting their downstream applications. To address this, we present a generalizable camera simulation framework DriveCamSim, whose core innovation lies in the proposed Explicit Camera Modeling (ECM) mechanism. Instead of implicit interaction through vanilla attention, ECM establishes explicit pixel-wise correspondences across multi-view and multi-frame dimensions, decoupling the model from overfitting to the specific camera configurations (intrinsic/extrinsic parameters, number of views) and temporal sampling rates presented in the training…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The proposed Explicit Camera Modeling (ECM) is a strong technical contribution that directly addresses a major limitation in prior work. Decoupling the model from fixed camera configurations is a well-motivated and crucial step toward creating truly flexible and practical AD simulators. And the qualitative result looks great. 2. Identity-aware embedding is inspiring, maintaining consistency for dynamic objects, which is a common failure point in generative models. 3. The qualitative and qua
1. The proposed Explicit Camera Modeling enables the model to generalize to unseen parameters. Although there are qualitative experiments showing that the model outperforms previous methods in the generalization of camera configuration. It would be better if there are quantitative evaluations with modern feed-forward SFM models, showing that the generated image follows the desired camera parameters. 2. Ablation is an important part and should be included in the main paper. 3. The submission do
- A novel and compact explicit camera modeling mechanism is proposed. - Detailed visualization results are provided, offering valuable insights.
- In Table 3, the perspective-based and attention-based control mechanisms are presented, but it is unclear which methods these mechanisms correspond to. - The novelty of the approach is not immediately apparent in the methods section, as it contains a lot of detailed explanations about handling different conditions.
1. The proposed ECM mechanism effectively decouples the model from specific camera parameters and temporal sampling rates by establishing explicit pixel-wise correspondences in 3D space, filling the gap of poor generalization in existing implicit modeling methods. 2. The information-preserving control mechanism, especially the identity-aware extension, successfully mitigates information loss in conditional encoding and injection, improving both controllability and foreground temporal consistency
1. Using MagicDrive and DreamForge as baselines is insufficient, as they do not support camera parameter generalization. On the contrary, the paper should conduct a direct comparison with 3D-based generative works like MagicDrive3D[a] to show advantages. 2. No video-specific evaluation metrics are employed. Benchmarks like W-CODA2024[b], which are tailored for video generation quality and consistency, should be adopted to comprehensively assess temporal performance. 3. Key components in Figure 4
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Transportation and Mobility Innovations
MethodsWhy is Venmo saying something went wrong? — Identify the Issue!
