LAOF: Robust Latent Action Learning with Optical Flow Constraints
Xizhou Bu, Jiexi Lyu, Fulei Sun, Ruichen Yang, Zhiqiang Ma, Wei Li

TL;DR
LAOF introduces a pseudo-supervised framework utilizing optical flow to learn robust latent action representations from videos, effectively handling distractors and reducing reliance on labeled data, thus enhancing downstream imitation and reinforcement learning.
Contribution
The paper presents LAOF, a novel optical flow-based pseudo-supervised method for latent action learning that is robust to distractors and effective with minimal or no action labels.
Findings
LAOF outperforms existing methods on downstream tasks.
Optical flow constraints stabilize training and improve representation quality.
LAOF matches or surpasses fully supervised methods with only 1% action labels.
Abstract
Learning latent actions from large-scale videos is crucial for the pre-training of scalable embodied foundation models, yet existing methods often struggle with action-irrelevant distractors. Although incorporating action supervision can alleviate these distractions, its effectiveness is restricted by the scarcity of available action labels. Optical flow represents pixel-level motion between consecutive frames, naturally suppressing background elements and emphasizing moving objects. Motivated by this, we propose robust Latent Action learning with Optical Flow constraints, called LAOF, a pseudo-supervised framework that leverages the agent's optical flow as an action-driven signal to learn latent action representations robust to distractors. Experimental results show that the latent representations learned by LAOF outperform existing methods on downstream imitation learning and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
