Boxer: Robust Lifting of Open-World 2D Bounding Boxes to 3D

Daniel DeTone; Tianwei Shen; Fan Zhang; Lingni Ma; Julian Straub; Richard Newcombe; and Jakob Engel

arXiv:2604.05212·cs.CV·April 8, 2026

Boxer: Robust Lifting of Open-World 2D Bounding Boxes to 3D

Daniel DeTone, Tianwei Shen, Fan Zhang, Lingni Ma, Julian Straub, Richard Newcombe, and Jakob Engel

PDF

2 Models

TL;DR

Boxer is a transformer-based method that lifts 2D object detections into 3D bounding boxes using multi-view fusion, depth information, and uncertainty modeling, reducing the need for extensive 3D annotations.

Contribution

The paper introduces Boxer, a novel approach combining 2D detection, transformer-based lifting, and multi-view fusion for open-world 3D object localization with minimal 3D training data.

Findings

01

Outperforms state-of-the-art in open-world 3D bounding box lifting.

02

Achieves 0.532 mAP without dense depth, surpassing CuTR.

03

Uses large-scale training with over 1.2 million 3D boxes.

Abstract

Detecting and localizing objects in space is a fundamental computer vision problem. While much progress has been made to solve 2D object detection, 3D object localization is much less explored and far from solved, especially for open-world categories. To address this research challenge, we propose Boxer, an algorithm to estimate static 3D bounding boxes (3DBBs) from 2D open-vocabulary object detections, posed images and optional depth either represented as a sparse point cloud or dense depth. At its core is BoxerNet, a transformer-based network which lifts 2D bounding box (2DBB) proposals into 3D, followed by multi-view fusion and geometric filtering to produce globally consistent de-duplicated 3DBBs in metric world space. Boxer leverages the power of existing 2DBB detection algorithms (e.g. DETIC, OWLv2, SAM3) to localize objects in 2D. This allows the main BoxerNet model to focus on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.