Towards Fusing Point Cloud and Visual Representations for Imitation Learning
Atalay Donat, Xiaogang Jia, Xi Huang, Aleksandar Taranovic, Denis, Blessing, Ge Li, Hongyi Zhou, Hanyi Zhang, Rudolf Lioutikov, Gerhard Neumann

TL;DR
FPV-Net is a novel imitation learning approach that effectively fuses point cloud and RGB image data using adaptive layer norm conditioning, leading to state-of-the-art results on manipulation tasks.
Contribution
The paper introduces FPV-Net, a new method that combines point cloud and RGB data more effectively than previous approaches by leveraging adaptive layer norm conditioning.
Findings
Outperforms existing methods on RoboCasa benchmark
Effectively combines geometric and semantic information
Achieves state-of-the-art performance across tasks
Abstract
Learning for manipulation requires using policies that have access to rich sensory information such as point clouds or RGB images. Point clouds efficiently capture geometric structures, making them essential for manipulation tasks in imitation learning. In contrast, RGB images provide rich texture and semantic information that can be crucial for certain tasks. Existing approaches for fusing both modalities assign 2D image features to point clouds. However, such approaches often lose global contextual information from the original images. In this work, we propose FPV-Net, a novel imitation learning method that effectively combines the strengths of both point cloud and RGB modalities. Our method conditions the point-cloud encoder on global and local image tokens using adaptive layer norm conditioning, leveraging the beneficial properties of both modalities. Through extensive experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Human Pose and Action Recognition · Human Motion and Animation
