LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment

Dujun Nie; Fengjiao Chen; Qi Lv; Jun Kuang; Xiaoyu Li; Xuezhi Cao; Xunliang Cai

arXiv:2604.11689·cs.CV·April 14, 2026

LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment

Dujun Nie, Fengjiao Chen, Qi Lv, Jun Kuang, Xiaoyu Li, Xuezhi Cao, Xunliang Cai

PDF

1 Repo 2 Models 1 Datasets

TL;DR

LARY introduces a comprehensive benchmark to evaluate latent action representations derived from large-scale human videos, revealing that general visual models outperform specialized ones in vision-to-action tasks.

Contribution

The paper presents the LARY benchmark dataset and framework, enabling rigorous evaluation of latent action representations across semantic and control tasks, highlighting the superiority of general visual models.

Findings

01

General visual foundation models outperform specialized embodied models.

02

Latent-based visual space aligns better with physical action space than pixel-based space.

03

Semantic abstraction from vision is more effective for control than pixel-level reconstruction.

Abstract

While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale human video datasets lies in transforming visual signals into ontology-independent representations, known as latent actions. However, the capacity of latent action representation to derive robust control from visual observations has yet to be rigorously evaluated. We introduce the Latent Action Representation Yielding (LARY) Benchmark, a unified framework for evaluating latent action representations on both high-level semantic actions (what to do) and low-level robotic control (how to do). The comprehensively curated dataset encompasses over one million videos (1,000 hours) spanning 151 action categories, alongside 620K image pairs and 595K motion trajectories across diverse embodiments and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

meituan-longcat/LARYBench
github

Models

Datasets

meituan-longcat/LARYBench
dataset· 13k dl
13k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.