Cubify Anything: Scaling Indoor 3D Object Detection

Justin Lazarow; David Griffiths; Gefen Kohavi; Francisco Crespo,; Afshin Dehghan

arXiv:2412.04458·cs.CV·December 6, 2024

Cubify Anything: Scaling Indoor 3D Object Detection

Justin Lazarow, David Griffiths, Gefen Kohavi, Francisco Crespo,, Afshin Dehghan

PDF

Open Access 1 Repo

TL;DR

This paper introduces a large-scale 3D object detection dataset and a novel Transformer-based detection method that outperforms existing point-based approaches, especially in noisy, real-world indoor scenes.

Contribution

The paper presents the CA-1M dataset with extensive 3D annotations and the CuTR Transformer model that predicts 3D boxes directly from 2D features, challenging traditional 3D inductive biases.

Findings

01

CuTR outperforms point-based methods in 3D recall (62%).

02

CA-1M dataset enables better generalization and robustness.

03

Pre-training on CA-1M improves performance on diverse datasets.

Abstract

We consider indoor 3D object detection with respect to a single RGB(-D) frame acquired from a commodity handheld device. We seek to significantly advance the status quo with respect to both data and modeling. First, we establish that existing datasets have significant limitations to scale, accuracy, and diversity of objects. As a result, we introduce the Cubify-Anything 1M (CA-1M) dataset, which exhaustively labels over 400K 3D objects on over 1K highly accurate laser-scanned scenes with near-perfect registration to over 3.5K handheld, egocentric captures. Next, we establish Cubify Transformer (CuTR), a fully Transformer 3D object detection baseline which rather than operating in 3D on point or voxel-based representations, predicts 3D boxes directly from 2D features derived from RGB(-D) inputs. While this approach lacks any 3D inductive biases, we show that paired with CA-1M, CuTR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

apple/ml-cubifyanything
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods

MethodsAttention Is All You Need · Adam · Position-Wise Feed-Forward Layer · Linear Layer · Softmax · Multi-Head Attention · Byte Pair Encoding · Label Smoothing · Dropout · Dense Connections