GAP: Geometric Anchor Pre-training for Data-Efficient Visuomotor Learning of Manipulation Tasks
Davide Buoso, Andrea Protopapa, Stefano Di Carlo, Francesca Pistilli, Giuseppe Averta

TL;DR
GAP introduces a pre-training method that regularizes visual representations to produce stable, geometry-aware keypoints, significantly improving data-efficient robotic manipulation learning under limited demonstrations.
Contribution
The paper proposes Geometric Anchor Pre-training (GAP), a lightweight, action-free pre-training stage that enhances geometric grounding in visual representations for manipulation tasks.
Findings
GAP outperforms fine-tuning and attention-based poolers in data-scarce scenarios.
Achieves 62% success on RoboMimic Can with 15 demonstrations.
Proxy pre-training is lightweight, decoupled, and reusable across tasks.
Abstract
Learning visuomotor policies from scarce expert demonstrations remains a core challenge in robotic manipulation. A primary hurdle lies in distilling high-dimensional RGB representations into control-relevant geometry without overfitting. While using frozen pre-trained Vision Foundation Models (VFMs) improves data efficiency, it also shifts most task adaptation onto a small spatial pooling module, which can latch onto task-irrelevant shortcuts and lose geometric grounding when finetuned with few data samples. More broadly, pre-trained visual representations used for policy learning have been observed to struggle under even minor scene perturbations, highlighting the need for robustness-oriented inductive biases. We propose Geometric Anchor Pre-training (GAP), a simple, action-free warm-up stage that regularizes the spatial adapter before downstream imitation learning. GAP pre-trains the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
