TL;DR
UniLACT introduces a depth-aware transformer-based model for vision-language-action tasks, enhancing latent action representations with 3D geometric structure for improved manipulation performance.
Contribution
It proposes UniLACT and UniLARN, novel frameworks for depth-aware latent action learning that incorporate geometric priors into vision-language-action models.
Findings
Outperforms RGB-only latent action models in simulation and real-world tasks.
Effectively incorporates depth information to improve spatial understanding.
Demonstrates robustness across in-domain and out-of-domain scenarios.
Abstract
Latent action representations learned from unlabeled videos have recently emerged as a promising paradigm for pretraining vision-language-action (VLA) models without explicit robot action supervision. However, latent actions derived solely from RGB observations primarily encode appearance-driven dynamics and lack explicit 3D geometric structure, which is essential for precise and contact-rich manipulation. To address this limitation, we introduce UniLACT, a transformer-based VLA model that incorporates geometric structure through depth-aware latent pretraining, enabling downstream policies to inherit stronger spatial priors. To facilitate this process, we propose UniLARN, a unified latent action learning framework based on inverse and forward dynamics objectives that learns a shared embedding space for RGB and depth while explicitly modeling their cross-modal interactions. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
