PointMapPolicy: Structured Point Cloud Processing for Multi-Modal Imitation Learning

Xiaogang Jia; Qian Wang; Anrui Wang; Han A. Wang; Bal\'azs Gyenes; Emiliyan Gospodinov; Xinkai Jiang; Ge Li; Hongyi Zhou; Weiran Liao; Xi Huang; Maximilian Beck; Moritz Reuss; Rudolf Lioutikov; Gerhard Neumann

arXiv:2510.20406·cs.RO·January 27, 2026

PointMapPolicy: Structured Point Cloud Processing for Multi-Modal Imitation Learning

Xiaogang Jia, Qian Wang, Anrui Wang, Han A. Wang, Bal\'azs Gyenes, Emiliyan Gospodinov, Xinkai Jiang, Ge Li, Hongyi Zhou, Weiran Liao, Xi Huang, Maximilian Beck, Moritz Reuss, Rudolf Lioutikov, Gerhard Neumann

PDF

Open Access

TL;DR

PointMapPolicy introduces a structured point cloud processing method that enhances multi-modal perception in robotic manipulation, achieving state-of-the-art results by fusing point clouds with RGB data using a novel structured grid approach.

Contribution

The paper presents PointMapPolicy, a new approach that conditions diffusion policies on structured point grids, enabling better geometric understanding and multi-modal fusion for robotic tasks.

Findings

01

Achieves state-of-the-art performance on RoboCasa and CALVIN benchmarks.

02

Effectively fuses point cloud and RGB data for improved manipulation.

03

Demonstrates robustness in real robot evaluations.

Abstract

Robotic manipulation systems benefit from complementary sensing modalities, where each provides unique environmental information. Point clouds capture detailed geometric structure, while RGB images provide rich semantic context. Current point cloud methods struggle to capture fine-grained detail, especially for complex tasks, which RGB methods lack geometric awareness, which hinders their precision and generalization. We introduce PointMapPolicy, a novel approach that conditions diffusion policies on structured grids of points without downsampling. The resulting data type makes it easier to extract shape and spatial relationships from observations, and can be transformed between reference frames. Yet due to their structure in a regular grid, we enable the use of established computer vision techniques directly to 3D data. Using xLSTM as a backbone, our model efficiently fuses the point…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Robot Manipulation and Learning · Robotics and Sensor-Based Localization