From Words to Poses: Enhancing Novel Object Pose Estimation with Vision   Language Models

Tessa Pulli; Stefan Thalhammer; Simon Schwaiger; Markus Vincze

arXiv:2409.05413·cs.CV·September 10, 2024

From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models

Tessa Pulli, Stefan Thalhammer, Simon Schwaiger, Markus Vincze

PDF

Open Access

TL;DR

This paper introduces a novel zero-shot 6D object pose estimation framework leveraging vision language models, enabling robots to detect and estimate poses of novel objects without prior training.

Contribution

It proposes a promptable zero-shot pose estimation method using language embeddings and relevancy maps, advancing open-set object detection in robotics.

Findings

01

Effective coarse localization of objects using language-embedded NeRF

02

Demonstrated zero-shot pose estimation at instance and category levels

03

Analyzed hyperparameters influencing relevancy map accuracy

Abstract

Robots are increasingly envisioned to interact in real-world scenarios, where they must continuously adapt to new situations. To detect and grasp novel objects, zero-shot pose estimators determine poses without prior knowledge. Recently, vision language models (VLMs) have shown considerable advances in robotics applications by establishing an understanding between language input and image input. In our work, we take advantage of VLMs zero-shot capabilities and translate this ability to 6D object pose estimation. We propose a novel framework for promptable zero-shot 6D object pose estimation using language embeddings. The idea is to derive a coarse location of an object based on the relevancy map of a language-embedded NeRF reconstruction and to compute the pose estimate with a point cloud registration method. Additionally, we provide an analysis of LERF's suitability for open-set object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Hand Gesture Recognition Systems · Advanced Image and Video Retrieval Techniques