Deep Multimodal Embedding: Manipulating Novel Objects with Point-clouds, Language and Trajectories
Jaeyong Sung, Ian Lenz, Ashutosh Saxena

TL;DR
This paper presents a deep neural network that learns a shared embedding space for point-clouds, language, and trajectories to enable robots to manipulate novel objects more effectively, improving accuracy and inference speed.
Contribution
It introduces a novel deep learning approach for multimodal embedding of sensor data, enabling better reasoning over diverse modalities for robotic manipulation.
Findings
Significant accuracy improvements over previous methods.
Faster inference times in manipulation tasks.
Successful real-world robot experiments.
Abstract
A robot operating in a real-world environment needs to perform reasoning over a variety of sensor modalities such as vision, language and motion trajectories. However, it is extremely challenging to manually design features relating such disparate modalities. In this work, we introduce an algorithm that learns to embed point-cloud, natural language, and manipulation trajectory data into a shared embedding space with a deep neural network. To learn semantically meaningful spaces throughout our network, we use a loss-based margin to bring embeddings of relevant pairs closer together while driving less-relevant cases from different modalities further apart. We use this both to pre-train its lower layers and fine-tune our final embedding space, leading to a more robust representation. We test our algorithm on the task of manipulating novel objects and appliances based on prior experience…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
