Why Latent Actions Fail, and How to Prevent It
Jung Min Lee, Taehyun Cho, Li Zhao, Jungwoo Lee

TL;DR
This paper analyzes how exogenous state in videos impairs latent action models and proposes methods to improve the learning of action representations by focusing on endogenous components.
Contribution
It extends a linear LAM framework to explicitly model exogenous state and provides a theoretical analysis of how to mitigate its interference.
Findings
Minimizing reconstruction encodes exogenous info from future observations.
Focusing on endogenous components improves latent action learning.
Auxiliary objectives like action-supervision promote consistency across exogenous states.
Abstract
Latent action models (LAMs) aim to learn action-like representations from unlabeled videos by compressing frame-to-frame changes. The frames of in-the-wild videos, however, contain not only the agent's own state but exogenous state such as background clutter. Since the exogenous state introduces changes unrelated to actions, it hinders reliable latent action learning. This paper investigates this problem analytically by extending a linear LAM framework to explicitly model exogenous state. Our analysis reveals two insights: (1) minimizing the standard reconstruction objective produces latent actions that encode exogenous information from future observation; and (2) learning in a representation space that focuses on endogenous components is a key to mitigating the interference of noise. We further show that previously proposed auxiliary objectives, such as action-supervision, provably…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
