BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion

Yuqing Lan; Chenyang Zhu; Zhirui Gao; Jiazhao Zhang; Yihan Cao; Renjiao Yi; Yijie Wang; Kai Xu

arXiv:2506.15610·cs.CV·August 26, 2025

BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion

Yuqing Lan, Chenyang Zhu, Zhirui Gao, Jiazhao Zhang, Yihan Cao, Renjiao Yi, Yijie Wang, Kai Xu

PDF

Open Access

TL;DR

BoxFusion introduces a real-time, reconstruction-free 3D object detection framework that fuses multi-view bounding boxes using an efficient association and optimization process, enabling scalable and fast open-vocabulary detection in large environments.

Contribution

It presents a novel online 3D detection method that avoids dense point cloud reconstruction, leveraging multi-view box fusion with an association and optimization module for improved efficiency.

Findings

01

Achieves state-of-the-art performance on ScanNetV2 and CA-1M datasets.

02

Demonstrates real-time perception in environments over 1000 square meters.

03

Exhibits strong generalization across various scenarios.

Abstract

Open-vocabulary 3D object detection has gained significant interest due to its critical applications in autonomous driving and embodied AI. Existing detection methods, whether offline or online, typically rely on dense point cloud reconstruction, which imposes substantial computational overhead and memory constraints, hindering real-time deployment in downstream tasks. To address this, we propose a novel reconstruction-free online framework tailored for memory-efficient and real-time 3D detection. Specifically, given streaming posed RGB-D video input, we leverage Cubify Anything as a pre-trained visual foundation model (VFM) for single-view 3D object detection by bounding boxes, coupled with CLIP to capture open-vocabulary semantics of detected objects. To fuse all detected bounding boxes across different views into a unified one, we employ an association module for correspondences of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition