D$^3$Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Rearrangement
Yixuan Wang, Mingtong Zhang, Zhuoran Li, Tarik Kelestemur, Katherine, Driggs-Campbell, Jiajun Wu, Li Fei-Fei, Yunzhu Li

TL;DR
D$^3$Fields introduces a dynamic, semantic 3D representation that fuses visual features for flexible, zero-shot robotic rearrangement, outperforming existing methods in real and simulated environments.
Contribution
The paper presents D$^3$Fields, a novel implicit 3D descriptor that captures dynamics and semantics, enabling zero-shot generalization in robotic rearrangement tasks.
Findings
Effective in zero-shot rearrangement tasks
Outperforms state-of-the-art implicit 3D representations
Demonstrates robustness in real-world and simulation environments
Abstract
Scene representation is a crucial design choice in robotic manipulation systems. An ideal representation is expected to be 3D, dynamic, and semantic to meet the demands of diverse manipulation tasks. However, previous works often lack all three properties simultaneously. In this work, we introduce DFields -- dynamic 3D descriptor fields. These fields are implicit 3D representations that take in 3D points and output semantic features and instance masks. They can also capture the dynamics of the underlying 3D environments. Specifically, we project arbitrary 3D points in the workspace onto multi-view 2D visual observations and interpolate features derived from visual foundational models. The resulting fused descriptor fields allow for flexible goal specifications using 2D images with varied contexts, styles, and instances. To evaluate the effectiveness of these descriptor fields, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Robot Manipulation and Learning · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention · Residual Connection · Dense Connections · Layer Normalization · Vision Transformer · self-DIstillation with NO labels
