Learning 3D Perception from Others' Predictions
Jinsu Yoo, Zhenyang Feng, Tai-Yu Pan, Yihong Sun, Cheng Perng Phoo,, Xiangyu Chen, Mark Campbell, Kilian Q. Weinberger, Bharath Hariharan, Wei-Lun, Chao

TL;DR
This paper introduces a label-efficient method for 3D object detection that learns from nearby units' predictions, addressing issues like viewpoint mismatch and mislocalization, and improving detection accuracy with minimal annotated data.
Contribution
The paper proposes a distance-based curriculum and pseudo label refinement to effectively learn 3D perception from other units' predictions, reducing the need for extensive annotations.
Findings
Effective pseudo label refinement reduces data requirements
Distance-based curriculum improves learning from multiple viewpoints
Method outperforms baseline approaches in real-world scenarios
Abstract
Accurate 3D object detection in real-world environments requires a huge amount of annotated data with high quality. Acquiring such data is tedious and expensive, and often needs repeated effort when a new sensor is adopted or when the detector is deployed in a new environment. We investigate a new scenario to construct 3D object detectors: learning from the predictions of a nearby unit that is equipped with an accurate detector. For example, when a self-driving car enters a new area, it may learn from other traffic participants whose detectors have been optimized for that area. This setting is label-efficient, sensor-agnostic, and communication-efficient: nearby units only need to share the predictions with the ego agent (e.g., car). Naively using the received predictions as ground-truths to train the detector for the ego car, however, leads to inferior performance. We systematically…
Peer Reviews
Decision·ICLR 2025 Poster
- The quality of the paper's figures is high and relatively clear and explicit. - This method ensures the accuracy of the detector while reducing the cost of annotation, and its effectiveness has been verified on both real and simulated datasets.
1. This paper introduces a variant of the offboard 3D object detection method. It only selects an unsupervised scheme as a baseline, which is inappropriate. For a fair comparison, the proposed paper should be compared with offboard 3D object detection methods with similar experimental settings. These methods also rely on an accurate detector to label unannotated scenes. The authors should select the correct baselines[1] [2] for comparison. [1] DetZero: Rethinking Offboard 3D Object Detection wi
+ The paper introduces an interesting problem formulation for 3D object detection in resource-limited settings, suggesting a new collaborative information-sharing approach to reduce labeling costs. + The ablation study is well designed. Results demonstrate the contribution of different design components to the overall performance, sharing valuable insights into the effectiveness of these different components in the proposed pipeline.
- Practicality and applicability of the problem setting. One of my main concerns is the practicality of the proposed problem setting. First, the approach relies heavily on the assumption that neighboring cars with accurate detectors are always available when needed, but this may not always be feasible. Additionally, without having a good pre-trained detector but attempting to rely on other cars’ detection sharing seems unsafe. This mechanism will also need to resolve multiple other dependencies
1. The paper is well-structured, with a clear abstract, introduction, methodology, experiments, and conclusion sections that logically flow from one to the next. 2. The proposed Refining & Discovering Boxes for 3D Perception from Others’ Predictions (R&B-POP) method is simple and effective. 3. The paper has extensive experiments and the figures are very professionally made.
1. The scenario assumed in the paper has great limitations, that is, learning knowledge from nearby agents to achieve detection. Theoretically, the learning process needs to be fast enough so that the current vehicles can predict new objects. However, I did not see the authors' analysis of the algorithm's training speed and discussion of actual deployment limitations. 2. Box ranker seems to be just an IoU scoring method, which is adopted in most detectors, such as Voxel-RCNN, PV-RCNN, they are n
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · 3D Shape Modeling and Analysis
MethodsGreedy Policy Search
