Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation
Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong,, Jiangmiao Pang

TL;DR
This paper introduces Seer, a scalable end-to-end inverse dynamics model using Transformers that effectively integrates vision and action for robotic manipulation, achieving state-of-the-art results in simulation and real-world tasks.
Contribution
The paper presents a novel predictive inverse dynamics model, Seer, trained end-to-end with large-scale robotic datasets, demonstrating superior generalization and performance over previous methods.
Findings
Achieves 13% improvement on LIBERO-LONG benchmark
Sets new state-of-the-art on CALVIN ABC-D with 4.28 average length
Outperforms previous methods in real-world manipulation tasks
Abstract
Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on "action," which involves behavior cloning from extensive collections of robotic data, while the other emphasizes "vision," enhancing model generalization by pre-training representations or generative models, also referred to as world models, using large-scale visual datasets. This paper presents an end-to-end paradigm that predicts actions using inverse dynamics models conditioned on the robot's forecasted visual states, named Predictive Inverse Dynamics Models (PIDM). By closing the loop between vision and action, the end-to-end PIDM can be a better scalable action learner. In practice, we use Transformers to process both visual states and actions, naming the model Seer. It is initially pre-trained on large-scale robotic datasets, such as DROID, and can be adapted to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Robotic Mechanisms and Dynamics · Teleoperation and Haptic Systems
