UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

Manish Kumar Govind; Dominick Reilly; Pu Wang; Srijan Das

arXiv:2602.20231·cs.RO·April 10, 2026

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

Manish Kumar Govind, Dominick Reilly, Pu Wang, Srijan Das

PDF

1 Repo 1 Models

TL;DR

UniLACT introduces a depth-aware transformer-based model for vision-language-action tasks, enhancing latent action representations with 3D geometric structure for improved manipulation performance.

Contribution

It proposes UniLACT and UniLARN, novel frameworks for depth-aware latent action learning that incorporate geometric priors into vision-language-action models.

Findings

01

Outperforms RGB-only latent action models in simulation and real-world tasks.

02

Effectively incorporates depth information to improve spatial understanding.

03

Demonstrates robustness across in-domain and out-of-domain scenarios.

Abstract

Latent action representations learned from unlabeled videos have recently emerged as a promising paradigm for pretraining vision-language-action (VLA) models without explicit robot action supervision. However, latent actions derived solely from RGB observations primarily encode appearance-driven dynamics and lack explicit 3D geometric structure, which is essential for precise and contact-rich manipulation. To address this limitation, we introduce UniLACT, a transformer-based VLA model that incorporates geometric structure through depth-aware latent pretraining, enabling downstream policies to inherit stronger spatial priors. To facilitate this process, we propose UniLARN, a unified latent action learning framework based on inverse and forward dynamics objectives that learns a shared embedding space for RGB and depth while explicitly modeling their cross-modal interactions. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://manishgovind.github.io/unilact-vla
github

Models

🤗
mgovind7/UniLACT
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.