Pre-training Auto-regressive Robotic Models with 4D Representations

Dantong Niu; Yuvan Sharma; Haoru Xue; Giscard Biamby; Junyi Zhang; Ziteng Ji; Trevor Darrell; Roei Herzig

arXiv:2502.13142·cs.RO·May 20, 2025

Pre-training Auto-regressive Robotic Models with 4D Representations

Dantong Niu, Yuvan Sharma, Haoru Xue, Giscard Biamby, Junyi Zhang, Ziteng Ji, Trevor Darrell, Roei Herzig

PDF

Open Access 2 Datasets 1 Video

TL;DR

This paper introduces ARM4R, a pre-training approach for robotic models using 4D representations derived from human videos, improving transfer learning and performance across robotic tasks.

Contribution

The paper presents a novel auto-regressive robotic model leveraging 4D representations from human videos, enabling effective transfer learning to robotic control tasks.

Findings

01

ARM4R improves transfer efficiency from human videos to robots

02

Enhanced performance across diverse robotic environments

03

Utilizes 3D point tracking and monocular depth estimation

Abstract

Foundation models pre-trained on massive unlabeled datasets have revolutionized natural language and computer vision, exhibiting remarkable generalization capabilities, thus highlighting the importance of pre-training. Yet, efforts in robotics have struggled to achieve similar success, limited by either the need for costly robotic annotations or the lack of representations that effectively model the physical world. In this paper, we introduce ARM4R, an Auto-regressive Robotic Model that leverages low-level 4D Representations learned from human video data to yield a better pre-trained robotic model. Specifically, we focus on utilizing 3D point tracking representations from videos derived by lifting 2D representations into 3D space via monocular depth estimation across time. These 4D representations maintain a shared geometric structure between the points and robot state representations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

Pre-training Auto-regressive Robotic Models with 4D Representations· slideslive

Taxonomy

TopicsImage Processing and 3D Reconstruction

MethodsFocus