Towards Open-Vocabulary Multimodal 3D Object Detection with Attributes
Xinhao Xiang, Kuan-Chuan Peng, Suhas Lohit, Michael J. Jones, Jiawei Zhang

TL;DR
This paper introduces OVODA, a framework for open-vocabulary 3D object detection with attribute recognition, leveraging foundation models and a new dataset, OVAD, to improve recognition of novel objects and their attributes in autonomous systems.
Contribution
The paper presents a novel open-vocabulary 3D detection framework that jointly detects objects and attributes without needing known anchor sizes, supported by a new attribute-annotated dataset.
Findings
Outperforms state-of-the-art in open-vocabulary detection on nuScenes and Argoverse 2
Successfully recognizes object attributes without prior class size knowledge
Demonstrates effectiveness of foundation model integration and prompt tuning techniques
Abstract
3D object detection plays a crucial role in autonomous systems, yet existing methods are limited by closed-set assumptions and struggle to recognize novel objects and their attributes in real-world scenarios. We propose OVODA, a novel framework enabling both open-vocabulary 3D object and attribute detection with no need to know the novel class anchor size. OVODA uses foundation models to bridge the semantic gap between 3D features and texts while jointly detecting attributes, e.g., spatial relationships, motion states, etc. To facilitate such research direction, we propose OVAD, a new dataset that supplements existing 3D object detection benchmarks with comprehensive attribute annotations. OVODA incorporates several key innovations, including foundation model feature concatenation, prompt tuning strategies, and specialized techniques for attribute detection, including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
