Thinker: A vision-language foundation model for embodied intelligence
Baiyu Pan, Daqin Luo, Junpeng Yang, Jiyuan Wang, Yixuan Zhang, Hailin Shi, Jichao Jiao

TL;DR
Thinker is a large vision-language model designed for embodied intelligence in robotics, addressing perspective confusion and temporal reasoning errors through a new dataset and a joint video comprehension approach, achieving state-of-the-art results.
Contribution
The paper introduces a novel vision-language foundation model, Thinker, with a tailored dataset and a joint frame-video input method for improved robotic perception and reasoning.
Findings
Achieves state-of-the-art results on task planning benchmarks.
Effectively incorporates key frames and full videos for better comprehension.
Addresses perspective confusion and temporal reasoning errors.
Abstract
When large vision-language models are applied to the field of robotics, they encounter problems that are simple for humans yet error-prone for models. Such issues include confusion between third-person and first-person perspectives and a tendency to overlook information in video endings during temporal reasoning. To address these challenges, we propose Thinker, a large vision-language foundation model designed for embodied intelligence. We tackle the aforementioned issues from two perspectives. Firstly, we construct a large-scale dataset tailored for robotic perception and reasoning, encompassing ego-view videos, visual grounding, spatial understanding, and chain-of-thought data. Secondly, we introduce a simple yet effective approach that substantially enhances the model's capacity for video comprehension by jointly incorporating key frames and full video sequences as inputs. Our model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Human Pose and Action Recognition
