Proprioception Enhances Vision Language Model in Generating Captions and Subtask Segmentations for Robot Task
Kanata Suzuki, Shota Shimizu, Tetsuya Ogata

TL;DR
This paper investigates how incorporating low-level robot motion data into Vision Language Models improves their ability to generate accurate robot task captions and segment subtasks, enhancing robot imitation learning.
Contribution
It introduces a method that integrates robot motion data into VLMs for improved captioning and subtask segmentation in robotic tasks, addressing the challenge of understanding low-level motion.
Findings
Motion data improves captioning accuracy
Enhanced subtask segmentation performance
Validated through simulator experiments
Abstract
From the perspective of future developments in robotics, it is crucial to verify whether foundation models trained exclusively on offline data, such as images and language, can understand the robot motion. In particular, since Vision Language Models (VLMs) do not include low-level motion information from robots in their training datasets, video understanding including trajectory information remains a significant challenge. In this study, we assess two capabilities of VLMs through a video captioning task with low-level robot motion information: (1) automatic captioning of robot tasks and (2) segmentation of a series of tasks. Both capabilities are expected to enhance the efficiency of robot imitation learning by linking language and motion and serve as a measure of the foundation model's performance. The proposed method generates multiple "scene" captions using image captions and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI
