Vision-Language Models Unlock Task-Centric Latent Actions
Alexander Nikulin, Ilya Zisman, Albina Klepach, Denis Tarasov, Alexander Derevyagin, Andrei Polubarov, Lyubaykin Nikita, Vladislav Kurenkov

TL;DR
This paper introduces a method leveraging Vision-Language Models' common-sense reasoning to improve Latent Action Models by filtering out distractors, significantly enhancing task success rates in complex video environments.
Contribution
It proposes a novel approach using promptable VLM representations to distinguish meaningful actions from noise in LAM training, improving robustness and performance.
Findings
VLM promptability varies significantly across models.
Ignoring distractors improves latent action quality.
Up to six-fold increase in success rates on Distracting MetaWorld.
Abstract
Latent Action Models (LAMs) have rapidly gained traction as an important component in the pre-training pipelines of leading Vision-Language-Action models. However, they fail when observations contain action-correlated distractors, often encoding noise instead of meaningful latent actions. Humans, on the other hand, can effortlessly distinguish task-relevant motions from irrelevant details in any video given only a brief task description. In this work, we propose to utilize the common-sense reasoning abilities of Vision-Language Models (VLMs) to provide promptable representations, effectively separating controllable changes from the noise in unsupervised way. We use these representations as targets during LAM training and benchmark a wide variety of popular VLMs, revealing substantial variation in the quality of promptable representations as well as their robustness to different prompts…
Peer Reviews
Decision·Submitted to ICLR 2026
+ Well-Motivated Solution: Directly addresses a known failure mode of LAMs. + The "twin observation" experiment is a powerful motivator, and the large-scale VLM benchmark provides strong, data-driven insights. + Experimental results demonstrates a substantial improvement in success rates on the Distracting MetaWorld benchmark.
- The core concept of using VLM embeddings as representations for control was previously explored by Chen et al. [9] and Huang et al. [24]. This work applies a similar idea to a different, albeit important, problem (LAM robustness). - The benchmarking is almost exclusively against the baseline LAPO [41] and its own variants. It lacks comparison to other contemporary LAMs (e.g., UniVLA [8]) or alternative distractor-handling techniques, limiting the claim to a broader state-of-the-art. - The an
- This work tackles an important problem of disambiguating actions in LAM. Often, considered robotics setups are too simplistic for it to matter, but this is crucial in more complex environments. - The authors present a clear experiment to demonstrate the targeted problem. - The experiments across VLMs are exhaustive, both across models and prompt choices. - Most of the performance compared to the distractor free setting is able to be recovered by the model.
1) Saying that the method is “without any supervision” (line 57) is slightly misleading as it relies both on task information and the use of a VLM to extract relevant information. It is however a weaker supervision than previous works using action labels. 2) As pointed out by the authors, the linear probing evaluation does not guarantee the minimality of actions and one could imagine that both the robot action as well as the “noise” are captured. While the proposed bottleneck to 128 dimensions
Leveraging the common-sense reasoning capabilities of VLMs to learn stronger latent actions centered on controllable changes is well motivated and addresses an important challenge facing current LAMs The paper conducts an extensive empirical study to validate an optimal strategy for extracting proptable representations from VLMs. This study is conducted across a wide range of recent SOTA VLMs The promptable representations are effective and consistently improve performance over baseline LAMs i
Overall the novelty seems limited to the reviewer. As referenced by the authors, Chen et al. [1] proposed promptable representations, and much of the evaluation setting follows Nikulin et al. [2]. While the motivation is reasonable, the underlying reason why promptable representations lead to such improvements remains unexplored and poorly understood to the reviewer * It is unclear why promptable representations from VLMs should be preferred to other methods like UniVLA that aim to disentangle
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Explainable Artificial Intelligence (XAI)
