Robotic Applications of Pre-Trained Vision-Language Models to Various Recognition Behaviors
Kento Kawaharazuka, Yoshiki Obinata, Naoaki Kanazawa, Kei Okada,, Masayuki Inaba

TL;DR
This paper reviews methods for applying pre-trained vision-language models to robotic recognition tasks, emphasizing flexible, non-retraining approaches to enhance robot understanding and broaden application scope.
Contribution
It categorizes and summarizes five methods to utilize vision-language models in robotics without retraining, enabling diverse recognition behaviors.
Findings
Effective methods for state recognition, object recognition, and more.
Enhanced flexibility and applicability of vision-language models in robotics.
Potential for new robotic applications using these methods.
Abstract
In recent years, a number of models that learn the relations between vision and language from large datasets have been released. These models perform a variety of tasks, such as answering questions about images, retrieving sentences that best correspond to images, and finding regions in images that correspond to phrases. Although there are some examples, the connection between these pre-trained vision-language models and robotics is still weak. If they are directly connected to robot motions, they lose their versatility due to the embodiment of the robot and the difficulty of data collection, and become inapplicable to a wide range of bodies and situations. Therefore, in this study, we categorize and summarize the methods to utilize the pre-trained vision-language models flexibly and easily in a way that the robot can understand, without directly connecting them to robot motions. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
