Robotic Applications of Pre-Trained Vision-Language Models to Various   Recognition Behaviors

Kento Kawaharazuka; Yoshiki Obinata; Naoaki Kanazawa; Kei Okada,; Masayuki Inaba

arXiv:2303.05674·cs.RO·March 19, 2024·1 cites

Robotic Applications of Pre-Trained Vision-Language Models to Various Recognition Behaviors

Kento Kawaharazuka, Yoshiki Obinata, Naoaki Kanazawa, Kei Okada,, Masayuki Inaba

PDF

Open Access

TL;DR

This paper reviews methods for applying pre-trained vision-language models to robotic recognition tasks, emphasizing flexible, non-retraining approaches to enhance robot understanding and broaden application scope.

Contribution

It categorizes and summarizes five methods to utilize vision-language models in robotics without retraining, enabling diverse recognition behaviors.

Findings

01

Effective methods for state recognition, object recognition, and more.

02

Enhanced flexibility and applicability of vision-language models in robotics.

03

Potential for new robotic applications using these methods.

Abstract

In recent years, a number of models that learn the relations between vision and language from large datasets have been released. These models perform a variety of tasks, such as answering questions about images, retrieving sentences that best correspond to images, and finding regions in images that correspond to phrases. Although there are some examples, the connection between these pre-trained vision-language models and robotics is still weak. If they are directly connected to robot motions, they lose their versatility due to the embodiment of the robot and the difficulty of data collection, and become inapplicable to a wide range of bodies and situations. Therefore, in this study, we categorize and summarize the methods to utilize the pre-trained vision-language models flexibly and easily in a way that the robot can understand, without directly connecting them to robot motions. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning