EmbodiedMAE: A Unified 3D Multi-Modal Representation for Robot Manipulation
Zibin Dong, Fei Ni, Yifu Yuan, Yinchuan Li, Jianye Hao

TL;DR
EmbodiedMAE introduces a unified 3D multi-modal representation for robot manipulation, improving performance and efficiency over existing models by effectively integrating RGB, depth, and point cloud data.
Contribution
The paper develops EmbodiedMAE, a novel multi-modal masked autoencoder that learns comprehensive 3D representations, and enhances the DROID dataset with high-quality 3D data for embodied vision tasks.
Findings
Outperforms state-of-the-art vision models in simulation and real-world tasks.
Exhibits strong scaling behavior with model size.
Enables effective policy learning from 3D inputs.
Abstract
We present EmbodiedMAE, a unified 3D multi-modal representation for robot manipulation. Current approaches suffer from significant domain gaps between training datasets and robot manipulation tasks, while also lacking model architectures that can effectively incorporate 3D information. To overcome these limitations, we enhance the DROID dataset with high-quality depth maps and point clouds, constructing DROID-3D as a valuable supplement for 3D embodied vision research. Then we develop EmbodiedMAE, a multi-modal masked autoencoder that simultaneously learns representations across RGB, depth, and point cloud modalities through stochastic masking and cross-modal fusion. Trained on DROID-3D, EmbodiedMAE consistently outperforms state-of-the-art vision foundation models (VFMs) in both training efficiency and final performance across 70 simulation tasks and 20 real-world robot manipulation…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper is exceptionally well-written and clear. The methodology is easy to follow, from the data preparation for DROID-3D to the detailed explanation of the encoder, multi-modal decoder, and distillation process. - It provides a robust and scalable method for effectively utilizing 3D inputs, which are crucial for precise manipulation but often degrade performance in prior VFM adaptation attempts. - The DROID-3D dataset is a valuable public resource, and the demonstration of superior perform
- The core contribution is a unified 3D model, yet the real-world results show the Point Cloud (PC) policies (EmbodiedMAE-PC) significantly underperform the RGB-only and RGBD variants on the xArm platform. This contradicts the paper's goal of effective 3D fusion. The paper attributes this to sensor noise, but this suggests the DP3 encoder or the PC representation itself is not robust enough. A more thorough analysis or ablation comparing different PC encoders is necessary to validate the PC pipe
- The paper is clearly written and easy to follow, with technical concepts presented in a coherent and accessible manner. - The experimental setup is well-structured and comprehensive, demonstrating careful design and thorough evaluation across diverse benchmarks. - The analysis in Section 3.2 presents intriguing and insightful experimental designs that effectively validate the model’s cross-modal learning capabilities. - The work makes a meaningful contribution to the open-source community by r
- The paper does not clearly justify the choice of MAE-based pre-training over alternative paradigms such as CLIP-style contrastive learning or DINO-style self-distillation. This decision is central to the method’s novelty, yet MAE is introduced abruptly (e.g., L48) without sufficient motivation or discussion of trade-offs. A deeper explanation of why MAE is particularly suitable for embodied 3D perception — and why contrastive or language-conditioned methods may be less effective — would signif
1. This work provides an end-to-end solution addressing both data scarcity (via DROID-3D) and architectural limitations (via EmbodiedMAE), offering substantial value to the research community. 2. The use of ZED SDK for generating metric depth maps and point clouds represents a significant improvement over commonly used estimated depth methods, providing temporally consistent 3D data that is crucial for robotic manipulation. 3. The combination of modality-agnostic stochastic masking (using Diri
1. While the paper demonstrates strong performance metrics, it lacks discussion of inference latency for the different EmbodiedMAE variants. For real-world robotic deployment where real-time performance is critical, understanding the latency characteristics on target hardware would help assess practical applicability. 2. The observation that point-cloud-only policies underperform RGB-only inputs is noted but not thoroughly analyzed. The paper would benefit from exploring whether alternative poi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Robot Manipulation and Learning
