UniPAD: A Universal Pre-training Paradigm for Autonomous Driving

Honghui Yang; Sha Zhang; Di Huang; Xiaoyang Wu; Haoyi Zhu; and Tong He; Shixiang Tang; Hengshuang Zhao; Qibo Qiu; Binbin Lin; and Xiaofei He; Wanli Ouyang

arXiv:2310.08370·cs.CV·April 9, 2024·2 cites

UniPAD: A Universal Pre-training Paradigm for Autonomous Driving

Honghui Yang, Sha Zhang, Di Huang, Xiaoyang Wu, Haoyi Zhu, and Tong He, Shixiang Tang, Hengshuang Zhao, Qibo Qiu, Binbin Lin, and Xiaofei He, Wanli Ouyang

PDF

Open Access 1 Repo 4 Reviews

TL;DR

UniPAD introduces a versatile self-supervised pre-training paradigm for autonomous driving that leverages 3D volumetric differentiable rendering to improve scene understanding across various 3D tasks.

Contribution

It presents a novel 3D pre-training method using differentiable rendering, enhancing feature learning for autonomous driving beyond traditional 2D-based approaches.

Findings

01

Significantly improves baseline performance on lidar, camera, and combined data.

02

Achieves state-of-the-art results on nuScenes 3D object detection and segmentation.

03

Demonstrates effective integration into both 2D and 3D frameworks.

Abstract

In the context of autonomous driving, the significance of effective feature learning is widely acknowledged. While conventional 3D self-supervised pre-training methods have shown widespread success, most methods follow the ideas originally designed for 2D images. In this paper, we present UniPAD, a novel self-supervised learning paradigm applying 3D volumetric differentiable rendering. UniPAD implicitly encodes 3D space, facilitating the reconstruction of continuous 3D shape structures and the intricate appearance characteristics of their 2D projections. The flexibility of our method enables seamless integration into both 2D and 3D frameworks, enabling a more holistic comprehension of the scenes. We manifest the feasibility and effectiveness of UniPAD by conducting extensive experiments on various downstream 3D tasks. Our method significantly improves lidar-, camera-, and…

Peer Reviews

Decision·ICLR 2024 Conference Withdrawn Submission

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

- A self-supervised learning approach that is capable of handling both point clouds and multi-view images. - A discussion on the ray sampling for reducing memory consumption. - Benchmarking results on two 3D perception tasks under three sensor setups. Consistent improvements over baselines.

Weaknesses

- Limited novelty of UniPAD. While many engineering efforts made, most of the ground of the proposed UniPAD approach stems from [R1]. - Relatively outdated baselines. The chosen baselines (UVTR, FCOS3D, and SpUNet) from three sensor setups are from previous literature; the effectiveness of UniPAD on top of stronger baselines remains unknown. - Missing comparisons with current arts. Approaches benchmarked do not include the state of the arts.

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

1. The overall idea of masking the input, as well as having some sort of a "rendering loss" in 2D sounds reasonable 2. Pretraining with UniPAD does seem to show a clear improvement in NDS and mAP numbers as shown in table 1 3. The existing experiments and ablations (tables 3 through 8) were good, e.g. it was interesting to see decoder depth matters more than width 4. The paper was overall easy to read

Weaknesses

**Experiments:** 1. There is no qualitative results/visualizations 2. Related to above, would be great if the authors share insights from qualitative evaluations: Does the model behave differently, i.e. are the error modes different as a result of pre-training with UniPAD (e.g. having less mis-predicted blobs, flickering less, etc.)? 3. It seems to me that there are 2 components to UniPAD: Masking (i.e. asking the model to fill in information) and rendering (i.e. a loss that is aware of project

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

The overall approach of using rendering to minimize discrepancy between rendered projection and the input on self-driving is novel.

Weaknesses

More grounded discussions on what is (which part of the scenes) actually better represented would help. Is it working better on parts of the scenes that are much more intricate, or is it a general overall improvement in accuracy.

Reviewer 04Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

- This paper is the first paper that leverages volumetric differentiable rendering to resolve the perception pre-training problem. - Their method unifies the multi-view image representation and the LiDAR point cloud representation into a volumetric space. - Their method outperforms the existing baseline approaches on the benchmarks, which demonstrates the effectiveness of their proposed method. - The paper writing is clear and easy to follow.

Weaknesses

- The paper title should be **A Universal Pre-training Paradigm for 3D Perception**, instead of **A Universal Pre-training Paradigm for Autonomous Driving**, as this paper mainly focuses on pre-training for perception tasks rather than all driving tasks including prediction and planning. - Table 8 (a): the detection performance wrt. masking ratio didn't change much when the masking ratio ranged from 0.1 to 0.7. A small masking ratio leads to little information loss, but this pre-training strate

Code & Models

Repositories

Nightmare-n/UniPAD
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Neural Network Applications · Robotics and Sensor-Based Localization