LAOF: Robust Latent Action Learning with Optical Flow Constraints

Xizhou Bu; Jiexi Lyu; Fulei Sun; Ruichen Yang; Zhiqiang Ma; Wei Li

arXiv:2511.16407·cs.RO·March 24, 2026

LAOF: Robust Latent Action Learning with Optical Flow Constraints

Xizhou Bu, Jiexi Lyu, Fulei Sun, Ruichen Yang, Zhiqiang Ma, Wei Li

PDF

Open Access

TL;DR

LAOF introduces a pseudo-supervised framework utilizing optical flow to learn robust latent action representations from videos, effectively handling distractors and reducing reliance on labeled data, thus enhancing downstream imitation and reinforcement learning.

Contribution

The paper presents LAOF, a novel optical flow-based pseudo-supervised method for latent action learning that is robust to distractors and effective with minimal or no action labels.

Findings

01

LAOF outperforms existing methods on downstream tasks.

02

Optical flow constraints stabilize training and improve representation quality.

03

LAOF matches or surpasses fully supervised methods with only 1% action labels.

Abstract

Learning latent actions from large-scale videos is crucial for the pre-training of scalable embodied foundation models, yet existing methods often struggle with action-irrelevant distractors. Although incorporating action supervision can alleviate these distractions, its effectiveness is restricted by the scarcity of available action labels. Optical flow represents pixel-level motion between consecutive frames, naturally suppressing background elements and emphasizing moving objects. Motivated by this, we propose robust Latent Action learning with Optical Flow constraints, called LAOF, a pseudo-supervised framework that leverages the agent's optical flow as an action-driven signal to learn latent action representations robust to distractors. Experimental results show that the latent representations learned by LAOF outperform existing methods on downstream imitation learning and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications