Hierarchical Latent Action Model
Hanjung Kim, Lerrel Pinto, Seon Joo Kim

TL;DR
HiLAM is a hierarchical latent action model that captures long-term temporal structure in actionless videos by discovering high-level skills, improving dynamic skill discovery over existing methods.
Contribution
It introduces a hierarchical architecture that leverages pretrained LAMs to model long-term dependencies and discover high-level skills in actionless videos.
Findings
Improves over baseline in dynamic skill discovery
Robustly captures long-term temporal dependencies
Effectively aggregates latent sequences into high-level skills
Abstract
Latent Action Models (LAMs) enable learning from actionless data for applications ranging from robotic control to interactive world models. However, existing LAMs typically focus on short-horizon frame transitions and capture low-level motion while overlooking longer-term temporal structure. In contrast, actionless videos often contain temporally extended and high-level skills. We present HiLAM, a hierarchical latent action model that discovers latent skills by modeling long-term temporal information. To capture these dependencies across long horizons, we utilize a pretrained LAM as a low-level extractor. This architecture aggregates latent action sequences, which contain the underlying dynamic patterns of the video, into high-level latent skills. Our experiments demonstrate that HiLAM improves over the baseline and exhibits robust dynamic skill discovery.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Generative Adversarial Networks and Image Synthesis
