LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion

Jiangran Lyu; Kai Liu; Xuheng Zhang; Haoran Liao; Yusen Feng; Wenxuan Zhu; Tingrui Shen; Jiayi Chen; Jiazhao Zhang; Yifei Dong; Wenbo Cui; Senmao Qi; Shuo Wang; Yixin Zheng; Mi Yan; Xuesong Shi; Haoran Li; Dongbin Zhao; Ming-Yu Liu; Zhizheng Zhang; Li Yi; Yizhou Wang; He Wang

arXiv:2602.12215·cs.RO·February 13, 2026

LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion

Jiangran Lyu, Kai Liu, Xuheng Zhang, Haoran Liao, Yusen Feng, Wenxuan Zhu, Tingrui Shen, Jiayi Chen, Jiazhao Zhang, Yifei Dong, Wenbo Cui, Senmao Qi, Shuo Wang, Yixin Zheng, Mi Yan, Xuesong Shi, Haoran Li, Dongbin Zhao, Ming-Yu Liu, Zhizheng Zhang, Li Yi, Yizhou Wang, He Wang

PDF

Open Access

TL;DR

LDA-1B is a large-scale robot foundation model that ingests diverse embodied data to learn dynamics, policy, and visual forecasting, enabling improved performance and data efficiency in robotic tasks.

Contribution

The paper introduces LDA-1B, a scalable robot foundation model that leverages heterogeneous embodied data through structured latent space prediction and multi-modal transformers.

Findings

01

Outperforms prior methods by up to 21%, 48%, and 23% on various tasks.

02

Assembles and standardizes a large dataset of 30k hours of embodied interactions.

03

Enables data-efficient fine-tuning, improving performance with low-quality data.

Abstract

Recent robot foundation models largely rely on large-scale behavior cloning, which imitates expert actions but discards transferable dynamics knowledge embedded in heterogeneous embodied data. While the Unified World Model (UWM) formulation has the potential to leverage such diverse data, existing instantiations struggle to scale to foundation-level due to coarse data usage and fragmented datasets. We introduce LDA-1B, a robot foundation model that scales through universal embodied data ingestion by jointly learning dynamics, policy, and visual forecasting, assigning distinct roles to data of varying quality. To support this regime at scale, we assemble and standardize EI-30k, an embodied interaction dataset comprising over 30k hours of human and robot trajectories in a unified format. Scalable dynamics learning over such heterogeneous data is enabled by prediction in a structured DINO…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Social Robot Interaction and HRI