PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

Xiaopeng Lin; Shijie Lian; Bin Yu; Ruoqi Yang; Zhaolong Shen; Changti Wu; Yuzhuo Miao; Yurun Jin; Yukun Shi; Jiyan He; Cong Huang; Bojun Cheng; Kai Chen

arXiv:2512.16793·cs.RO·February 5, 2026

PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

Xiaopeng Lin, Shijie Lian, Bin Yu, Ruoqi Yang, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Yurun Jin, Yukun Shi, Jiyan He, Cong Huang, Bojun Cheng, Kai Chen

PDF

Open Access

TL;DR

This paper introduces PhysBrain, a model trained on a large dataset of transformed human egocentric videos, which enhances robotic physical reasoning and planning by bridging the gap between human perception and robot embodiment.

Contribution

The authors propose an Egocentric2Embodiment pipeline to convert human egocentric videos into robot-relevant supervision, enabling scalable training of PhysBrain for improved physical intelligence in robots.

Findings

01

PhysBrain significantly improves egocentric understanding and planning.

02

Enhanced sample efficiency in VLA fine-tuning.

03

Higher success rates in robot control tasks.

Abstract

Robotic generalization relies on physical intelligence: the ability to reason about state changes, contact-rich interactions, and long-horizon planning under egocentric perception and action. Vision Language Models (VLMs) are essential to Vision-Language-Action (VLA) systems, but the reliance on third-person training data creates a viewpoint gap for humanoid robots. Collecting massive robot-centric data is an ideal but impractical solution due to cost and diversity constraints. Conversely, human egocentric videos offer a highly scalable data source with rich interaction context, yet the embodiment mismatch prevents the direct application. To bridge this gap, we propose an Egocentric2Embodiment Translation Pipeline that transforms raw human egocentric videos into multi-level, schema-driven embodiment supervision with enforced evidence grounding and temporal consistency, enabling the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Domain Adaptation and Few-Shot Learning