Proprioception Enhances Vision Language Model in Generating Captions and Subtask Segmentations for Robot Task

Kanata Suzuki; Shota Shimizu; Tetsuya Ogata

arXiv:2512.20876·cs.RO·January 13, 2026

Proprioception Enhances Vision Language Model in Generating Captions and Subtask Segmentations for Robot Task

Kanata Suzuki, Shota Shimizu, Tetsuya Ogata

PDF

Open Access

TL;DR

This paper investigates how incorporating low-level robot motion data into Vision Language Models improves their ability to generate accurate robot task captions and segment subtasks, enhancing robot imitation learning.

Contribution

It introduces a method that integrates robot motion data into VLMs for improved captioning and subtask segmentation in robotic tasks, addressing the challenge of understanding low-level motion.

Findings

01

Motion data improves captioning accuracy

02

Enhanced subtask segmentation performance

03

Validated through simulator experiments

Abstract

From the perspective of future developments in robotics, it is crucial to verify whether foundation models trained exclusively on offline data, such as images and language, can understand the robot motion. In particular, since Vision Language Models (VLMs) do not include low-level motion information from robots in their training datasets, video understanding including trajectory information remains a significant challenge. In this study, we assess two capabilities of VLMs through a video captioning task with low-level robot motion information: (1) automatic captioning of robot tasks and (2) segmentation of a series of tasks. Both capabilities are expected to enhance the efficiency of robot imitation learning by linking language and motion and serve as a measure of the foundation model's performance. The proposed method generates multiple "scene" captions using image captions and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI