TL;DR
PhysBrain 1.0 leverages large-scale egocentric videos to learn physical commonsense, enhancing robot understanding and control across diverse benchmarks with state-of-the-art results.
Contribution
It introduces a novel data engine that converts egocentric videos into structured supervision for training physical priors in vision-language models.
Findings
Achieves SOTA on multiple benchmarks including ERQA, PhysBench, and RoboCasa.
Demonstrates strong out-of-domain performance, especially on SimplerEnv.
Shows that scaling physical commonsense from videos improves robot action understanding.
Abstract
Vision-language-action models have advanced rapidly, but robot trajectories alone provide limited coverage for learning broad physical understanding. PhysBrain 1.0 studies a complementary route: converting large-scale human egocentric video into structured physical commonsense supervision before robot adaptation. Our data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then turns them into question-answer supervision for training PhysBrain VLMs. The resulting physical priors are further transferred to VLA policies through a capability-preserving and language-sensitive adaptation design. Across multimodal QA benchmarks and embodied control benchmarks, including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, PhysBrain 1.0 achieves SOTA results and shows especially strong out-of-domain performance on SimplerEnv. These results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
