mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

Jonas Pai; Liam Achenbach; Victoriano Montesinos; Benedek Forrai; Oier Mees; Elvis Nava

arXiv:2512.15692·cs.RO·December 22, 2025

mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, Elvis Nava

PDF

Open Access

TL;DR

mimic-video introduces a video-based action model that enhances robot control by capturing physical dynamics during pretraining, leading to significant improvements in efficiency and performance over traditional vision-language models.

Contribution

the paper proposes mimic-video, a novel video-action model that integrates a pretrained video model with an inverse dynamics decoder, enabling better physical understanding for robotic manipulation.

Findings

01

achieves state-of-the-art results in robotic tasks

02

improves sample efficiency by 10x

03

reduces convergence time by 2x

Abstract

Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected static web data. As a result, despite improved semantic generalization, the policy must implicitly infer complex physical dynamics and temporal dependencies solely from robot trajectories. This reliance creates an unsustainable data burden, necessitating continuous, large-scale expert data collection to compensate for the lack of innate physical understanding. We contend that while vision-language pretraining effectively captures semantic priors, it remains blind to physical causality. A more effective paradigm leverages video to jointly capture semantics and visual dynamics during pretraining, thereby isolating the remaining task of low-level control. To this end, we introduce mimic-video, a novel Video-Action Model (VAM) that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis