From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
Linfang Zheng, Zikai Ouyang, Chen Wang, Jia Pan, and Wei Zhang

TL;DR
This survey reviews methods for learning robotic manipulation control interfaces from unannotated video data, categorizing approaches and analyzing their properties and challenges.
Contribution
It introduces an interface-centric taxonomy for video-to-control methods and analyzes control integration challenges in robotic manipulation from visual data.
Findings
Three families of methods identified: direct policies, latent-action, explicit visual interfaces.
Analysis of how each method family closes the control loop and verifies predictions.
Highlights the main challenge in robotics integration layer connecting video predictions to robot actions.
Abstract
Video is a scalable observation of physical dynamics: it captures how objects move, how contact unfolds, and how scenes evolve under interaction -- all without requiring robot action labels. Yet translating this temporal structure into reliable robotic control remains an open challenge, because video lacks action supervision and differs from robot experience in embodiment, viewpoint, and physical constraints. This survey reviews methods that exploit non-action-annotated temporal video to learn control interfaces for robotic manipulation. We introduce an \emph{interface-centric taxonomy} organized by where the video-to-control interface is constructed and what control properties it enables, identifying three families: direct video--action policies, which keep the interface implicit; latent-action methods, which route temporal structure through a compact learned intermediate; and explicit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
