Learning Skills from Action-Free Videos

Hung-Chieh Fang; Kuo-Han Hung; Chu-Rong Chen; Po-Jung Chou; Chun-Kai Yang; Po-Chen Ko; Yu-Chiang Wang; Yueh-Hua Wu; Min-Hung Chen; Shao-Hua Sun

arXiv:2512.20052·cs.AI·December 24, 2025

Learning Skills from Action-Free Videos

Hung-Chieh Fang, Kuo-Han Hung, Chu-Rong Chen, Po-Jung Chou, Chun-Kai Yang, Po-Chen Ko, Yu-Chiang Wang, Yueh-Hua Wu, Min-Hung Chen, Shao-Hua Sun

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SOF, a framework that learns high-level skills from action-free videos using optical flow, enabling better planning and execution of complex robot actions from visual data.

Contribution

It proposes a novel flow-based latent skill space that bridges the gap between video prediction and action translation, facilitating high-level planning from raw videos.

Findings

01

Improves multitask learning performance

02

Enables long-horizon skill composition

03

Learns directly from raw visual data

Abstract

Learning from videos offers a promising path toward generalist robots by providing rich visual and temporal priors beyond what real robot datasets contain. While existing video generative models produce impressive visual predictions, they are difficult to translate into low-level actions. Conversely, latent-action models better align videos with actions, but they typically operate at the single-step level and lack high-level planning capabilities. We bridge this gap by introducing Skill Abstraction from Optical Flow (SOF), a framework that learns latent skills from large collections of action-free videos. Our key idea is to learn a latent skill space through an intermediate representation based on optical flow that captures motion information aligned with both video dynamics and robot actions. By learning skills in this flow-based latent space, SOF enables high-level planning over…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 5

Strengths

1. The paper proposes quantizing flow representations to enhance predictability for Transformer-based architectures. Experiments show that this strategy effectively supports action-free skill learning. 2. The writing is clear and well-organized, with high-quality visualizations. 3. The analysis of skill tokens and progress provides interesting insights.

Weaknesses

1. A major weakness is the lack of comparison with the most critical baseline, i.e., direct flow prediction methods (e.g., ATM). Since the primary claimed contribution lies in introducing quantized flow representation, omitting this comparison undermines the soundness of the work. 2. The novelty and contribution are limited, as the approach appears to be a direct integration of prior works (e.g., LAPA quantization, ATM flow, and AVDC flow-to-action frameworks). While validating quantized flow re

Reviewer 02Rating 2Confidence 5

Strengths

The idea is well motivated, and the writing is easy to follow. The authors conduct experiments on three different domains: MetaWorld, LIBERO, and BridgeData.

Weaknesses

1. The core components of the method, which leverage optical flow as a mid-level representation and learns discrete skill tokens, are well-established in recent literature[1,2]. The paper's primary contribution appears to be the combination of these two ideas into a single framework, but it provides limited new algorithmic insight or theoretical foundation beyond this synthesis. 2. The paper positions only using a third-person view as an advantage over other methods like ATM (using a wrist camer

Reviewer 03Rating 4Confidence 5

Strengths

- The paper is easy to follow, and the figures are clear and intuitive, significantly aiding in the understanding of the proposed architecture. - The core motivation is well-defined. The choice to use optical flow instead of RGB frames for skill token learning is sensible, as it inherently removes redundant background and static scene information, thereby potentially increasing the information density of the skill tokens.

Weaknesses

Limited and Simplified Experimental Setup: The experimental validation is too simplistic and limited in scope. - The paper only tests on 9 Meta-World tasks (out of a total of 50). - For the Libero benchmark, only 10 Libero-Goal and 10 Libero-Long tasks are evaluated. Standard practice for Libero typically involves training and testing across all 4 Libero task suites (totaling 40 tasks). - While BridgeData is mentioned, it is only utilized for visualization analysis and not for policy training

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis · Social Robot Interaction and HRI