Towards Open-Vocabulary Multimodal 3D Object Detection with Attributes

Xinhao Xiang; Kuan-Chuan Peng; Suhas Lohit; Michael J. Jones; Jiawei Zhang

arXiv:2508.16812·cs.CV·August 26, 2025

Towards Open-Vocabulary Multimodal 3D Object Detection with Attributes

Xinhao Xiang, Kuan-Chuan Peng, Suhas Lohit, Michael J. Jones, Jiawei Zhang

PDF

TL;DR

This paper introduces OVODA, a framework for open-vocabulary 3D object detection with attribute recognition, leveraging foundation models and a new dataset, OVAD, to improve recognition of novel objects and their attributes in autonomous systems.

Contribution

The paper presents a novel open-vocabulary 3D detection framework that jointly detects objects and attributes without needing known anchor sizes, supported by a new attribute-annotated dataset.

Findings

01

Outperforms state-of-the-art in open-vocabulary detection on nuScenes and Argoverse 2

02

Successfully recognizes object attributes without prior class size knowledge

03

Demonstrates effectiveness of foundation model integration and prompt tuning techniques

Abstract

3D object detection plays a crucial role in autonomous systems, yet existing methods are limited by closed-set assumptions and struggle to recognize novel objects and their attributes in real-world scenarios. We propose OVODA, a novel framework enabling both open-vocabulary 3D object and attribute detection with no need to know the novel class anchor size. OVODA uses foundation models to bridge the semantic gap between 3D features and texts while jointly detecting attributes, e.g., spatial relationships, motion states, etc. To facilitate such research direction, we propose OVAD, a new dataset that supplements existing 3D object detection benchmarks with comprehensive attribute annotations. OVODA incorporates several key innovations, including foundation model feature concatenation, prompt tuning strategies, and specialized techniques for attribute detection, including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.