Thinker: A vision-language foundation model for embodied intelligence

Baiyu Pan; Daqin Luo; Junpeng Yang; Jiyuan Wang; Yixuan Zhang; Hailin Shi; Jichao Jiao

arXiv:2601.21199·cs.CV·January 30, 2026

Thinker: A vision-language foundation model for embodied intelligence

Baiyu Pan, Daqin Luo, Junpeng Yang, Jiyuan Wang, Yixuan Zhang, Hailin Shi, Jichao Jiao

PDF

Open Access

TL;DR

Thinker is a large vision-language model designed for embodied intelligence in robotics, addressing perspective confusion and temporal reasoning errors through a new dataset and a joint video comprehension approach, achieving state-of-the-art results.

Contribution

The paper introduces a novel vision-language foundation model, Thinker, with a tailored dataset and a joint frame-video input method for improved robotic perception and reasoning.

Findings

01

Achieves state-of-the-art results on task planning benchmarks.

02

Effectively incorporates key frames and full videos for better comprehension.

03

Addresses perspective confusion and temporal reasoning errors.

Abstract

When large vision-language models are applied to the field of robotics, they encounter problems that are simple for humans yet error-prone for models. Such issues include confusion between third-person and first-person perspectives and a tendency to overlook information in video endings during temporal reasoning. To address these challenges, we propose Thinker, a large vision-language foundation model designed for embodied intelligence. We tackle the aforementioned issues from two perspectives. Firstly, we construct a large-scale dataset tailored for robotic perception and reasoning, encompassing ego-view videos, visual grounding, spatial understanding, and chain-of-thought data. Secondly, we introduce a simple yet effective approach that substantially enhances the model's capacity for video comprehension by jointly incorporating key frames and full video sequences as inputs. Our model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Human Pose and Action Recognition