Point Bridge: 3D Representations for Cross Domain Policy Learning

Siddhant Haldar; Lars Johannsmeier; Lerrel Pinto; Abhishek Gupta; Dieter Fox; Yashraj Narang; Ajay Mandlekar

arXiv:2601.16212·cs.RO·March 26, 2026

Point Bridge: 3D Representations for Cross Domain Policy Learning

Siddhant Haldar, Lars Johannsmeier, Lerrel Pinto, Abhishek Gupta, Dieter Fox, Yashraj Narang, Ajay Mandlekar

PDF

Open Access 3 Reviews

TL;DR

Point Bridge introduces a domain-agnostic point-based representation framework that enables zero-shot sim-to-real transfer of robotic manipulation policies trained solely on synthetic data, significantly reducing the need for real-world datasets.

Contribution

The paper presents a novel point-based representation approach combined with vision-language models and transformer policies for effective zero-shot sim-to-real transfer without explicit visual alignment.

Findings

01

Up to 44% improvement in zero-shot transfer performance.

02

Up to 66% gains with limited real data.

03

Outperforms prior vision-based co-training methods.

Abstract

Robot foundation models are beginning to deliver on the promise of generalist robotic agents, yet progress remains constrained by the scarcity of large-scale real-world manipulation datasets. Simulation and synthetic data generation offer a scalable alternative, but their usefulness is limited by the visual domain gap between simulation and reality. In this work, we present Point Bridge, a framework that leverages unified, domain-agnostic point-based representations to unlock synthetic datasets for zero-shot sim-to-real policy transfer, without explicit visual or object-level alignment. Point Bridge combines automated point-based representation extraction via Vision-Language Models (VLMs), transformer-based policy learning, and efficient inference-time pipelines to train capable real-world manipulation agents using only synthetic data. With additional co-training on small sets of real…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1.Unified point representation: Mapping both sim and real to task-relevant points is pragmatic and deployment-friendly. 2.Empirical gains: The approach improves over image-based baselines in both zero-shot and limited co-train regimes. 3.Systematic ablations: The paper compares multiple depth/reconstruction sources and discusses viewpoint alignment, offering evidence for deployment trade-offs (success vs. frequency). 4,Implementation clarity: The synthetic-to-3D-to-policy pipeline is clearly

Weaknesses

1.Limited novelty: Most components (data generation, object filtering, depth estimation, policy learning) are existing modules strung together; the main contribution is a well-engineered integration and representation choice rather than a new learning principle. 2.Baseline coverage (3D/depth): Comparisons are primarily against image-based policies. Missing are baselines that take dense point clouds/depth directly (e.g., point-cloud based, depth-only Diffusion/BC variants) under matched data—mak

Reviewer 02Rating 0Confidence 4

Strengths

- The method shows good transfer from simulation training to real-world deployment. - The proposed pipeline is well-engineered, utilizing powerful open-vocabulary models likely capable of generalization to broader scenarios.

Weaknesses

- The work has very limited novelty. Point cloud and point track representations have been used in numerous previous works (as cited by the authors), for specialist policies as well as of generalist VLA models. While these works do not explicitly target the sim2real problem, they show capable policies on simulation data, human demonstrations and real robot demonstrations. In this context, especially human demonstration data also represents a domain transfer problem. - The work does not compare t

Reviewer 03Rating 4Confidence 3

Strengths

1.The central idea of a point-based representation is both powerful and elegant. It directly attacks the problem of visual domain gap by moving away from pixel-level inputs to a more geometric abstraction. This is a more scalable approach than striving for photorealistic simulation. 2.The framework presents a complete and highly automated pipeline. It intelligently integrates synthetic data generation, VLM-guided scene filtering, and modern policy learning into a cohesive system. The automation

Weaknesses

1.POINT BRIDGE exhibits a strong dependence on external pre-trained vision models. The entire pipeline’s entry point relies on models like Gemini and SAM2. Consequently, the robustness of POINT BRIDGE is inherently tied to the performance of these components, and any failures in perception cannot be easily corrected within the framework itself. 2.The framework relies on assumptions about a calibrated scene with known camera intrinsics and extrinsics. This requirement for a consistent reference

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics