TL;DR
SPA introduces a 3D spatial-aware representation learning framework that significantly improves embodied AI performance across diverse tasks by leveraging differentiable neural rendering and vision transformers.
Contribution
The paper presents SPA, a novel 3D spatial-awareness framework that enhances embodied AI understanding using neural rendering and comprehensive multi-task evaluation.
Findings
Outperforms 10+ state-of-the-art methods
Effective in 268 tasks across 8 simulators
Validated in real-world experiments
Abstract
In this paper, we introduce SPA, a novel representation learning framework that emphasizes the importance of 3D spatial awareness in embodied AI. Our approach leverages differentiable neural rendering on multi-view images to endow a vanilla Vision Transformer (ViT) with intrinsic spatial understanding. We present the most comprehensive evaluation of embodied representation learning to date, covering 268 tasks across 8 simulators with diverse policies in both single-task and language-conditioned multi-task scenarios. The results are compelling: SPA consistently outperforms more than 10 state-of-the-art representation methods, including those specifically designed for embodied AI, vision-centric tasks, and multi-modal applications, while using less training data. Furthermore, we conduct a series of real-world experiments to confirm its effectiveness in practical scenarios. These results…
Peer Reviews
Decision·ICLR 2025 Poster
1. The paper conducts one of the most extensive evaluations in embodied AI representation learning to date, covering 268 tasks across 8 different simulators. This large-scale evaluation provides a thorough comparison with multiple state-of-the-art methods showing a significant level of empirical rigor. 2. SPA uses neural rendering and multi-view images to enhance the 3D spatial awareness of the ViT, which is an effective way to give the model a better understanding of depth information and spati
1. SPA simply extends and adapts the ViT by adding neural rendering and 3D spatial features to enhance expressiveness, but the underlying architecture is still the same ViT that already exists; in contrast, many new model architectures make more independent innovations at the algorithmic level. This paper is lack sufficient groundbreaking innovations in model architecture. 2. The evaluation of SPA focuses primarily on imitative learning and does not fully explore reinforcement learning or other
Method addresses an important gap in current representation learning approaches by explicitly incorporating 3D spatial understanding through differentiable rendering. Comprehensive evaluation across a large number of tasks and simulators demonstrates the broad applicability of their approach. The self-supervised nature of their training signal (generated RGBD and semantic maps) is an interesting direction that reduces the need for expensive labeled data.
The paper's performance on LIBERO-spatial (Table 4) is somewhat counterintuitive. This seems like quite an important benchmark out of all the evaluation tasks, and given SPA's neural rendering pretraining objective, one would expect stronger results on spatial tasks. It seems to me that AM-RADIO should be a baseline comparison in Table 3, given that the feature maps are in used as supervision during pre-training.
1. The paper is very nicely written 2. The architecture is quite well thought out, tying several components effectively, a feat not easy to achieve or make it work. 3. The evaluation benchmark has 268 tasks, which is quite extensive and a big improvement over previous benchmarks. 4. Thorough ablations (mask ratio importance, dataset impact, etc) are very informative 5. Results are quite nice, showing the potential of SPA
1. Benchmark descriptions are not well written. Not clear what the tasks are supposed to be. 2. Tables do not have sufficient captions, and is a bit difficult to understand the metrics from the tables themselves. 3. It is not clear from the tables which methods are adapted from vision community to solve embodied AI tasks, thus making it difficult to assess the fairness of the comparison. 4. Real world task setting is missing the most common vision language tasks which might benefit from spati
Code & Models
Videos
Taxonomy
TopicsHuman Pose and Action Recognition
MethodsDense Connections · Residual Connection · Position-Wise Feed-Forward Layer · Adam · Attention Is All You Need · Linear Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings
