Hierarchical Latent Action Model

Hanjung Kim; Lerrel Pinto; Seon Joo Kim

arXiv:2603.05815·cs.RO·March 9, 2026

Hierarchical Latent Action Model

Hanjung Kim, Lerrel Pinto, Seon Joo Kim

PDF

Open Access

TL;DR

HiLAM is a hierarchical latent action model that captures long-term temporal structure in actionless videos by discovering high-level skills, improving dynamic skill discovery over existing methods.

Contribution

It introduces a hierarchical architecture that leverages pretrained LAMs to model long-term dependencies and discover high-level skills in actionless videos.

Findings

01

Improves over baseline in dynamic skill discovery

02

Robustly captures long-term temporal dependencies

03

Effectively aggregates latent sequences into high-level skills

Abstract

Latent Action Models (LAMs) enable learning from actionless data for applications ranging from robotic control to interactive world models. However, existing LAMs typically focus on short-horizon frame transitions and capture low-level motion while overlooking longer-term temporal structure. In contrast, actionless videos often contain temporally extended and high-level skills. We present HiLAM, a hierarchical latent action model that discovers latent skills by modeling long-term temporal information. To capture these dependencies across long horizons, we utilize a pretrained LAM as a low-level extractor. This architecture aggregates latent action sequences, which contain the underlying dynamic patterns of the video, into high-level latent skills. Our experiments demonstrate that HiLAM improves over the baseline and exhibits robust dynamic skill discovery.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation · Generative Adversarial Networks and Image Synthesis