Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving

Jiahao Wang; Bo Sun; Yijing Bai; Vincent Casser; Songyou Peng; Zehao Zhu; Meng-Li Shih; Xander Masotto; Shih-Yang Su; Kanaad V Parvate; Tiancheng Ge; Linn Bieske; Dragomir Anguelov; Mingxing Tan; Chiyu Max Jiang

arXiv:2605.22809·cs.CV·May 22, 2026

Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving

Jiahao Wang, Bo Sun, Yijing Bai, Vincent Casser, Songyou Peng, Zehao Zhu, Meng-Li Shih, Xander Masotto, Shih-Yang Su, Kanaad V Parvate, Tiancheng Ge, Linn Bieske, Dragomir Anguelov, Mingxing Tan, Chiyu Max Jiang

PDF

TL;DR

Sensor2Sensor introduces a generative model that converts in-the-wild dashcam videos into multi-modal sensor data, enhancing autonomous vehicle training datasets with diverse, realistic scenarios.

Contribution

It presents a novel diffusion-based approach to translate monocular videos into structured multi-modal sensor data without requiring paired training datasets.

Findings

01

High-fidelity sensor data generated from in-the-wild videos.

02

Quantitative evaluations show realistic and diverse sensor reconstructions.

03

Practical utility demonstrated by converting internet and dashcam footage.

Abstract

Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) fleets, while high-fidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage. In contrast, in-the-wild data from sources like dashcams offers immense scale and diversity, capturing critical long-tail scenarios and novel environments. However, this unstructured, in-the-wild video data is incompatible with ADS expecting structured, multi-modal sensor inputs for validation and training. To bridge this data gap, we propose Sensor2Sensor, a novel generative modeling paradigm that translates in-the-wild monocular dashcam videos into a high-fidelity, multi-modal sensor suite (AV logs) comprising multi-view camera images and LiDAR point clouds. A core challenge is the lack of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.