Point Bridge: 3D Representations for Cross Domain Policy Learning
Siddhant Haldar, Lars Johannsmeier, Lerrel Pinto, Abhishek Gupta, Dieter Fox, Yashraj Narang, Ajay Mandlekar

TL;DR
Point Bridge introduces a domain-agnostic point-based representation framework that enables zero-shot sim-to-real transfer of robotic manipulation policies trained solely on synthetic data, significantly reducing the need for real-world datasets.
Contribution
The paper presents a novel point-based representation approach combined with vision-language models and transformer policies for effective zero-shot sim-to-real transfer without explicit visual alignment.
Findings
Up to 44% improvement in zero-shot transfer performance.
Up to 66% gains with limited real data.
Outperforms prior vision-based co-training methods.
Abstract
Robot foundation models are beginning to deliver on the promise of generalist robotic agents, yet progress remains constrained by the scarcity of large-scale real-world manipulation datasets. Simulation and synthetic data generation offer a scalable alternative, but their usefulness is limited by the visual domain gap between simulation and reality. In this work, we present Point Bridge, a framework that leverages unified, domain-agnostic point-based representations to unlock synthetic datasets for zero-shot sim-to-real policy transfer, without explicit visual or object-level alignment. Point Bridge combines automated point-based representation extraction via Vision-Language Models (VLMs), transformer-based policy learning, and efficient inference-time pipelines to train capable real-world manipulation agents using only synthetic data. With additional co-training on small sets of real…
Peer Reviews
Decision·Submitted to ICLR 2026
1.Unified point representation: Mapping both sim and real to task-relevant points is pragmatic and deployment-friendly. 2.Empirical gains: The approach improves over image-based baselines in both zero-shot and limited co-train regimes. 3.Systematic ablations: The paper compares multiple depth/reconstruction sources and discusses viewpoint alignment, offering evidence for deployment trade-offs (success vs. frequency). 4,Implementation clarity: The synthetic-to-3D-to-policy pipeline is clearly
1.Limited novelty: Most components (data generation, object filtering, depth estimation, policy learning) are existing modules strung together; the main contribution is a well-engineered integration and representation choice rather than a new learning principle. 2.Baseline coverage (3D/depth): Comparisons are primarily against image-based policies. Missing are baselines that take dense point clouds/depth directly (e.g., point-cloud based, depth-only Diffusion/BC variants) under matched data—mak
- The method shows good transfer from simulation training to real-world deployment. - The proposed pipeline is well-engineered, utilizing powerful open-vocabulary models likely capable of generalization to broader scenarios.
- The work has very limited novelty. Point cloud and point track representations have been used in numerous previous works (as cited by the authors), for specialist policies as well as of generalist VLA models. While these works do not explicitly target the sim2real problem, they show capable policies on simulation data, human demonstrations and real robot demonstrations. In this context, especially human demonstration data also represents a domain transfer problem. - The work does not compare t
1.The central idea of a point-based representation is both powerful and elegant. It directly attacks the problem of visual domain gap by moving away from pixel-level inputs to a more geometric abstraction. This is a more scalable approach than striving for photorealistic simulation. 2.The framework presents a complete and highly automated pipeline. It intelligently integrates synthetic data generation, VLM-guided scene filtering, and modern policy learning into a cohesive system. The automation
1.POINT BRIDGE exhibits a strong dependence on external pre-trained vision models. The entire pipeline’s entry point relies on models like Gemini and SAM2. Consequently, the robustness of POINT BRIDGE is inherently tied to the performance of these components, and any failures in perception cannot be easily corrected within the framework itself. 2.The framework relies on assumptions about a calibrated scene with known camera intrinsics and extrinsics. This requirement for a consistent reference
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics
