FMimic: Foundation Models are Fine-grained Action Learners from Human Videos

Guangyan Chen; Meiling Wang; Te Cui; Yao Mu; Haoyang Lu; Zicai Peng; Mengxiao Hu; Tianxing Zhou; Mengyin Fu; Yi Yang; and Yufeng Yue

arXiv:2507.20622·cs.RO·July 29, 2025

FMimic: Foundation Models are Fine-grained Action Learners from Human Videos

Guangyan Chen, Meiling Wang, Te Cui, Yao Mu, Haoyang Lu, Zicai Peng, Mengxiao Hu, Tianxing Zhou, Mengyin Fu, Yi Yang, and Yufeng Yue

PDF

TL;DR

FMimic leverages foundation models to enable robotic systems to learn fine-grained actions directly from limited human videos, significantly improving performance in various manipulation tasks.

Contribution

This work introduces FMimic, a novel approach that uses foundation models for fine-grained action learning from minimal human video data, surpassing existing high-level plan-based methods.

Findings

01

Strong performance with just one human video

02

Outperforms other methods with five videos

03

Achieves over 39% improvement in multi-task RLBench experiments

Abstract

Visual imitation learning (VIL) provides an efficient and intuitive strategy for robotic systems to acquire novel skills. Recent advancements in foundation models, particularly Vision Language Models (VLMs), have demonstrated remarkable capabilities in visual and linguistic reasoning for VIL tasks. Despite this progress, existing approaches primarily utilize these models for learning high-level plans from human demonstrations, relying on pre-defined motion primitives for executing physical interactions, which remains a major bottleneck for robotic systems. In this work, we present FMimic, a novel paradigm that harnesses foundation models to directly learn generalizable skills at even fine-grained action levels, using only a limited number of human videos. Extensive experiments demonstrate that our FMimic delivers strong performance with a single human video, and significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.