ReMoBot: Retrieval-Based Few-Shot Imitation Learning for Mobile Manipulation with Vision Foundation Models

Yuying Zhang; Wenyan Yang; Francesco Verdoja; Ville Kyrki; Joni Pajarinen

arXiv:2408.15919·cs.RO·September 19, 2025

ReMoBot: Retrieval-Based Few-Shot Imitation Learning for Mobile Manipulation with Vision Foundation Models

Yuying Zhang, Wenyan Yang, Francesco Verdoja, Ville Kyrki, Joni Pajarinen

PDF

Open Access

TL;DR

ReMoBot introduces a retrieval-based few-shot imitation learning method utilizing vision foundation models to enable mobile robots to perform manipulation tasks effectively with limited demonstrations, demonstrating strong real-world performance and generalization.

Contribution

The paper presents ReMoBot, a novel retrieval-based few-shot imitation learning approach that leverages vision foundation models for mobile manipulation tasks, addressing data scarcity and partial observability issues.

Findings

01

ReMoBot achieves over 70% success rate on benchmark tasks with only 20 demonstrations.

02

ReMoBot outperforms baseline methods in real-world experiments without sim-to-real transfer.

03

The approach generalizes across different robot positions, object sizes, and materials.

Abstract

Imitation learning (IL) algorithms typically distill experience into parametric behavior policies to mimic expert demonstrations. However, with limited demonstrations, existing methods often struggle to generate accurate actions, particularly under partial observability. To address this problem, we introduce a few-shot IL approach, ReMoBot, which directly retrieves information from demonstrations to solve Mobile manipulation tasks with ego-centric visual observations. Given the current observation, ReMoBot utilizes vision foundation models to identify relevant demonstrations, considering visual similarity w.r.t. both individual observations and history trajectories. A motion selection policy then selects the proper command for the robot until the task is successfully completed. The performance of ReMoBot is evaluated on three mobile manipulation tasks with a Boston Dynamics Spot robot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Human Motion and Animation