OV9D: Open-Vocabulary Category-Level 9D Object Pose and Size Estimation

Junhao Cai; Yisheng He; Weihao Yuan; Siyu Zhu; Zilong Dong; Liefeng; Bo; Qifeng Chen

arXiv:2403.12396·cs.CV·March 20, 2024·1 cites

OV9D: Open-Vocabulary Category-Level 9D Object Pose and Size Estimation

Junhao Cai, Yisheng He, Weihao Yuan, Siyu Zhu, Zilong Dong, Liefeng, Bo, Qifeng Chen

PDF

Open Access

TL;DR

This paper introduces OV9D, a method for open-vocabulary category-level 9D object pose and size estimation, leveraging a new large-scale dataset and pre-trained visual-language models to generalize to unseen categories.

Contribution

The paper presents a novel framework combining a large-scale dataset and pre-trained models for open-vocabulary pose and size estimation at the category level.

Findings

01

Significant performance improvement over baselines.

02

Effective generalization to real-world unseen categories.

03

Large-scale dataset enhances training and evaluation.

Abstract

This paper studies a new open-set problem, the open-vocabulary category-level object pose and size estimation. Given human text descriptions of arbitrary novel object categories, the robot agent seeks to predict the position, orientation, and size of the target object in the observed scene image. To enable such generalizability, we first introduce OO3D-9D, a large-scale photorealistic dataset for this task. Derived from OmniObject3D, OO3D-9D is the largest and most diverse dataset in the field of category-level object pose and size estimation. It includes additional annotations for the symmetry axis of each category, which help resolve symmetric ambiguity. Apart from the large-scale dataset, we find another key to enabling such generalizability is leveraging the strong prior knowledge in pre-trained visual-language foundation models. We then propose a framework built on pre-trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Handwritten Text Recognition Techniques

MethodsDiffusion