Robotic State Recognition with Image-to-Text Retrieval Task of   Pre-Trained Vision-Language Model and Black-Box Optimization

Kento Kawaharazuka; Yoshiki Obinata; Naoaki Kanazawa; Kei; Okada; Masayuki Inaba

arXiv:2410.22707·cs.RO·October 31, 2024

Robotic State Recognition with Image-to-Text Retrieval Task of Pre-Trained Vision-Language Model and Black-Box Optimization

Kento Kawaharazuka, Yoshiki Obinata, Naoaki Kanazawa, Kei, Okada, Masayuki Inaba

PDF

Open Access

TL;DR

This paper introduces a novel robotic state recognition method leveraging pre-trained vision-language models and black-box optimization to improve accuracy and flexibility without retraining neural networks.

Contribution

It proposes using Image-to-Text Retrieval with prompt weighting for robotic state recognition, eliminating the need for retraining or manual programming.

Findings

01

Achieves higher accuracy with prompt weighting.

02

Enables recognition of challenging states like transparent doors and water flow.

03

Requires only prompts and weights, simplifying resource management.

Abstract

State recognition of the environment and objects, such as the open/closed state of doors and the on/off of lights, is indispensable for robots that perform daily life support and security tasks. Until now, state recognition methods have been based on training neural networks from manual annotations, preparing special sensors for the recognition, or manually programming to extract features from point clouds or raw images. In contrast, we propose a robotic state recognition method using a pre-trained vision-language model, which is capable of Image-to-Text Retrieval (ITR) tasks. We prepare several kinds of language prompts in advance, calculate the similarity between these prompts and the current image by ITR, and perform state recognition. By applying the optimal weighting to each prompt using black-box optimization, state recognition can be performed with higher accuracy. Experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Automated Systems