TL;DR
LAMP leverages image-editing as 3D priors to extract detailed 3D transformations, enabling precise and generalizable open-world manipulation in robotics.
Contribution
It introduces a novel method that lifts 2D image-editing cues into 3D representations for improved manipulation tasks.
Findings
Achieves accurate 3D transformations in manipulation tasks.
Demonstrates strong zero-shot generalization in open-world scenarios.
Outperforms existing methods in fine-grained spatial reasoning.
Abstract
Human-like generalization in open-world remains a fundamental challenge for robotic manipulation. Existing learning-based methods, including reinforcement learning, imitation learning, and vision-language-action-models (VLAs), often struggle with novel tasks and unseen environments. Another promising direction is to explore generalizable representations that capture fine-grained spatial and geometric relations for open-world manipulation. While large-language-model (LLMs) and vision-language-model (VLMs) provide strong semantic reasoning based on language or annotated 2D representations, their limited 3D awareness restricts their applicability to fine-grained manipulation. To address this, we propose LAMP, which lifts image-editing as 3D priors to extract inter-object 3D transformations as continuous, geometry-aware representations. Our key insight is that image-editing inherently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
