Learning Skills from Action-Free Videos
Hung-Chieh Fang, Kuo-Han Hung, Chu-Rong Chen, Po-Jung Chou, Chun-Kai Yang, Po-Chen Ko, Yu-Chiang Wang, Yueh-Hua Wu, Min-Hung Chen, Shao-Hua Sun

TL;DR
This paper introduces SOF, a framework that learns high-level skills from action-free videos using optical flow, enabling better planning and execution of complex robot actions from visual data.
Contribution
It proposes a novel flow-based latent skill space that bridges the gap between video prediction and action translation, facilitating high-level planning from raw videos.
Findings
Improves multitask learning performance
Enables long-horizon skill composition
Learns directly from raw visual data
Abstract
Learning from videos offers a promising path toward generalist robots by providing rich visual and temporal priors beyond what real robot datasets contain. While existing video generative models produce impressive visual predictions, they are difficult to translate into low-level actions. Conversely, latent-action models better align videos with actions, but they typically operate at the single-step level and lack high-level planning capabilities. We bridge this gap by introducing Skill Abstraction from Optical Flow (SOF), a framework that learns latent skills from large collections of action-free videos. Our key idea is to learn a latent skill space through an intermediate representation based on optical flow that captures motion information aligned with both video dynamics and robot actions. By learning skills in this flow-based latent space, SOF enables high-level planning over…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper proposes quantizing flow representations to enhance predictability for Transformer-based architectures. Experiments show that this strategy effectively supports action-free skill learning. 2. The writing is clear and well-organized, with high-quality visualizations. 3. The analysis of skill tokens and progress provides interesting insights.
1. A major weakness is the lack of comparison with the most critical baseline, i.e., direct flow prediction methods (e.g., ATM). Since the primary claimed contribution lies in introducing quantized flow representation, omitting this comparison undermines the soundness of the work. 2. The novelty and contribution are limited, as the approach appears to be a direct integration of prior works (e.g., LAPA quantization, ATM flow, and AVDC flow-to-action frameworks). While validating quantized flow re
The idea is well motivated, and the writing is easy to follow. The authors conduct experiments on three different domains: MetaWorld, LIBERO, and BridgeData.
1. The core components of the method, which leverage optical flow as a mid-level representation and learns discrete skill tokens, are well-established in recent literature[1,2]. The paper's primary contribution appears to be the combination of these two ideas into a single framework, but it provides limited new algorithmic insight or theoretical foundation beyond this synthesis. 2. The paper positions only using a third-person view as an advantage over other methods like ATM (using a wrist camer
- The paper is easy to follow, and the figures are clear and intuitive, significantly aiding in the understanding of the proposed architecture. - The core motivation is well-defined. The choice to use optical flow instead of RGB frames for skill token learning is sensible, as it inherently removes redundant background and static scene information, thereby potentially increasing the information density of the skill tokens.
Limited and Simplified Experimental Setup: The experimental validation is too simplistic and limited in scope. - The paper only tests on 9 Meta-World tasks (out of a total of 50). - For the Libero benchmark, only 10 Libero-Goal and 10 Libero-Long tasks are evaluated. Standard practice for Libero typically involves training and testing across all 4 Libero task suites (totaling 40 tasks). - While BridgeData is mentioned, it is only utilized for visualization analysis and not for policy training
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis · Social Robot Interaction and HRI
