What Do Latent Action Models Actually Learn?

Chuheng Zhang; Tim Pearce; Pushi Zhang; Kaixin Wang; Xiaoyu Chen; Wei Shen; Li Zhao; Jiang Bian

arXiv:2506.15691·cs.LG·November 13, 2025

What Do Latent Action Models Actually Learn?

Chuheng Zhang, Tim Pearce, Pushi Zhang, Kaixin Wang, Xiaoyu Chen, Wei Shen, Li Zhao, Jiang Bian

PDF

Open Access

TL;DR

This paper investigates whether latent action models (LAMs) truly learn action-related changes in videos or are confounded by noise, using a linear model to analyze their behavior and provide insights for improving their learning process.

Contribution

The paper introduces a linear model to analytically study LAMs, revealing their connection to PCA and proposing strategies to enhance learning of controllable changes.

Findings

01

LAMs can be influenced by noise, not just actions.

02

Data augmentation and cleaning can improve LAM focus on controllable changes.

03

Numerical simulations illustrate the impact of data structure on LAM learning.

Abstract

Latent action models (LAMs) aim to learn action-relevant changes from unlabeled videos by compressing changes between frames as latents. However, differences between video frames can be caused by controllable changes as well as exogenous noise, leading to an important concern -- do latents capture the changes caused by actions or irrelevant noise? This paper studies this issue analytically, presenting a linear model that encapsulates the essence of LAM learning, while being tractable.This provides several insights, including connections between LAM and principal component analysis (PCA), desiderata of the data-generating policy, and justification of strategies to encourage learning controllable changes using data augmentation, data cleaning, and auxiliary action-prediction. We also provide illustrative results based on numerical simulation, shedding light on the specific structure of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)