Bridging VLMs and Embodied Intelligence with Deliberate Practice Policy Optimization

Yi Zhang; Che Liu; Xiancong Ren; Hanchu Ni; Yingji Zhang; Shuai Zhang; Zeyuan Ding; Jiayu Hu; Haozhe Shan; Junbo Qi; Yan Bai; Dengjie Li; Jiachen Luo; Yidong Wang; Yong Dai; Zenglin Xu; Bin Shen; Qifan Wang; Jian Tang; Xiaozhu Ju

arXiv:2511.16602·cs.AI·November 21, 2025

Bridging VLMs and Embodied Intelligence with Deliberate Practice Policy Optimization

Yi Zhang, Che Liu, Xiancong Ren, Hanchu Ni, Yingji Zhang, Shuai Zhang, Zeyuan Ding, Jiayu Hu, Haozhe Shan, Junbo Qi, Yan Bai, Dengjie Li, Jiachen Luo, Yidong Wang, Yong Dai, Zenglin Xu, Bin Shen, Qifan Wang, Jian Tang, Xiaozhu Ju

PDF

Open Access

TL;DR

This paper introduces DPPO, a training framework that combines supervised fine-tuning and reinforcement learning to improve embodied intelligence models efficiently from limited data.

Contribution

The paper presents DPPO, a novel meta-learning framework that dynamically balances competence expansion and skill refinement for embodied AI.

Findings

01

Pelican-VL 1.0 improved by 20.3% with DPPO

02

Outperforms open-source models at 100B parameters by 10.6%

03

First systematic framework to address data and resource limitations in embodied AI

Abstract

Developing a universal and versatile embodied intelligence system presents two primary challenges: the critical embodied data bottleneck, where real-world data is scarce and expensive, and the algorithmic inefficiency of existing methods, which are resource-prohibitive. To address these limitations, we introduce Deliberate Practice Policy Optimization (DPPO), a metacognitive ``Metaloop'' training framework that dynamically alternates between supervised fine-tuning (competence expansion) and reinforcement learning (skill refinement). This enables automatic weakness identification and targeted resource allocation, specifically designed to maximize learning efficiency from sparse, finite data. Theoretically, DPPO can be formalised as a unified preference-learning framework. Empirically, training a vision-language embodied model with DPPO, referred to as Pelican-VL 1.0, yields a 20.3%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Artificial Intelligence in Games