TL;DR
Boxer is a transformer-based method that lifts 2D object detections into 3D bounding boxes using multi-view fusion, depth information, and uncertainty modeling, reducing the need for extensive 3D annotations.
Contribution
The paper introduces Boxer, a novel approach combining 2D detection, transformer-based lifting, and multi-view fusion for open-world 3D object localization with minimal 3D training data.
Findings
Outperforms state-of-the-art in open-world 3D bounding box lifting.
Achieves 0.532 mAP without dense depth, surpassing CuTR.
Uses large-scale training with over 1.2 million 3D boxes.
Abstract
Detecting and localizing objects in space is a fundamental computer vision problem. While much progress has been made to solve 2D object detection, 3D object localization is much less explored and far from solved, especially for open-world categories. To address this research challenge, we propose Boxer, an algorithm to estimate static 3D bounding boxes (3DBBs) from 2D open-vocabulary object detections, posed images and optional depth either represented as a sparse point cloud or dense depth. At its core is BoxerNet, a transformer-based network which lifts 2D bounding box (2DBB) proposals into 3D, followed by multi-view fusion and geometric filtering to produce globally consistent de-duplicated 3DBBs in metric world space. Boxer leverages the power of existing 2DBB detection algorithms (e.g. DETIC, OWLv2, SAM3) to localize objects in 2D. This allows the main BoxerNet model to focus on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
