Towards Fusing Point Cloud and Visual Representations for Imitation   Learning

Atalay Donat; Xiaogang Jia; Xi Huang; Aleksandar Taranovic; Denis; Blessing; Ge Li; Hongyi Zhou; Hanyi Zhang; Rudolf Lioutikov; Gerhard Neumann

arXiv:2502.12320·cs.RO·February 20, 2025

Towards Fusing Point Cloud and Visual Representations for Imitation Learning

Atalay Donat, Xiaogang Jia, Xi Huang, Aleksandar Taranovic, Denis, Blessing, Ge Li, Hongyi Zhou, Hanyi Zhang, Rudolf Lioutikov, Gerhard Neumann

PDF

Open Access

TL;DR

FPV-Net is a novel imitation learning approach that effectively fuses point cloud and RGB image data using adaptive layer norm conditioning, leading to state-of-the-art results on manipulation tasks.

Contribution

The paper introduces FPV-Net, a new method that combines point cloud and RGB data more effectively than previous approaches by leveraging adaptive layer norm conditioning.

Findings

01

Outperforms existing methods on RoboCasa benchmark

02

Effectively combines geometric and semantic information

03

Achieves state-of-the-art performance across tasks

Abstract

Learning for manipulation requires using policies that have access to rich sensory information such as point clouds or RGB images. Point clouds efficiently capture geometric structures, making them essential for manipulation tasks in imitation learning. In contrast, RGB images provide rich texture and semantic information that can be crucial for certain tasks. Existing approaches for fusing both modalities assign 2D image features to point clouds. However, such approaches often lose global contextual information from the original images. In this work, we propose FPV-Net, a novel imitation learning method that effectively combines the strengths of both point cloud and RGB modalities. Our method conditions the point-cloud encoder on global and local image tokens using adaptive layer norm conditioning, leveraging the beneficial properties of both modalities. Through extensive experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Human Pose and Action Recognition · Human Motion and Animation