Why Latent Actions Fail, and How to Prevent It

Jung Min Lee; Taehyun Cho; Li Zhao; Jungwoo Lee

arXiv:2605.20223·cs.CV·May 21, 2026

Why Latent Actions Fail, and How to Prevent It

Jung Min Lee, Taehyun Cho, Li Zhao, Jungwoo Lee

PDF

TL;DR

This paper analyzes how exogenous state in videos impairs latent action models and proposes methods to improve the learning of action representations by focusing on endogenous components.

Contribution

It extends a linear LAM framework to explicitly model exogenous state and provides a theoretical analysis of how to mitigate its interference.

Findings

01

Minimizing reconstruction encodes exogenous info from future observations.

02

Focusing on endogenous components improves latent action learning.

03

Auxiliary objectives like action-supervision promote consistency across exogenous states.

Abstract

Latent action models (LAMs) aim to learn action-like representations from unlabeled videos by compressing frame-to-frame changes. The frames of in-the-wild videos, however, contain not only the agent's own state but exogenous state such as background clutter. Since the exogenous state introduces changes unrelated to actions, it hinders reliable latent action learning. This paper investigates this problem analytically by extending a linear LAM framework to explicitly model exogenous state. Our analysis reveals two insights: (1) minimizing the standard reconstruction objective produces latent actions that encode exogenous information from future observation; and (2) learning in a representation space that focuses on endogenous components is a key to mitigating the interference of noise. We further show that previously proposed auxiliary objectives, such as action-supervision, provably…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.