TL;DR
EgoVLA introduces a method to train vision-language-action models using large-scale egocentric human videos, enabling robot manipulation tasks through retargeting human actions, with improved performance demonstrated on a new benchmark.
Contribution
The paper presents a novel approach to train VLA models from human videos, reducing reliance on robot data and enabling effective robot manipulation via action retargeting.
Findings
EgoVLA outperforms baseline models on manipulation tasks.
Using human videos enhances scene and task diversity.
Fine-tuning with few demonstrations achieves effective robot policies.
Abstract
Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness of scenes and tasks. With a VLA trained on human video that predicts human wrist and hand actions, we can perform Inverse Kinematics and retargeting to convert the human actions to robot actions. We fine-tune the model using a few robot manipulation demonstrations to obtain the robot policy, namely EgoVLA. We propose a simulation benchmark called Ego Humanoid Manipulation Benchmark, where we design diverse bimanual manipulation tasks with demonstrations. We fine-tune and…
Peer Reviews
Decision·Submitted to ICLR 2026
* The paper addressed an interesting and important problem. * The method proposed in this paper is intuitive, and the paper is easy to follow. * The proposed benchmark, although limited, is valuable. Within this benchmark, the proposed method outperforms both specialist and generalist baselines and generalizes better across viewpoints and object positions.
* The main limitation is that validation is mostly in simulation. While the contributions may still be meaningful for scaling robot learning with accessible human video data, the current justification based on only the simulation is not convincing. This is particularly important because the argument made in this paper is bold, which can also be justified through real-world experiments. In other words, the made claims may be true only within this paper's experimental settings. * While being int
- The model allows joint learning from robotic and non-robotic data, approaching a core problem of VLA training - The proposed VLA is well engineered matching human hand representations with dexterous robot manipulators at each training stage - The model shows improved results over baseline methods on the proposed Ego Humanoid Manipulation Benchmark
- The method only utilizes simple human demonstrations with clearly visible hands in simple environments. No results on day-to-day demonstrations are presented, limiting the potential advantage of reduced dataset collection cost. Showing generalization to these scenarios could potentially be done on datasets like Ego4D or Epic Kitchens. - EgoVLA is purely evaluated on simulation results. With VLA models often showing considerably different performance on real-word deployment, this shows little e
Leveraging human videos to learn dexterous manipulation policies is a promising direction, particularly given the high cost and effort required to collect teleoperation data. Aligning the action space between humans and robots through rigid 3D transformations and retargeting is a sound and well-motivated design choice. Overall, the proposed method demonstrates strong performance based on the reported evaluations.
The full system is evaluated only in simulation, and the domain gap between simulation and the real world is not adequately addressed. Moreover, the method still requires task-specific robot data for fine-tuning, which limits its scalability and generalization potential.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
