SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning

Yuyuan Yang; Junkun Hong; Hongrong Wang; Honghao Cai; Xunpeng Ren; Ge Wang; Mingcong Lei; Shenhao Yan; Jiahao Yang; Chengsi Yao; Xi Li; Yiming Zhao; Yatong Han; Jinke Ren

arXiv:2603.11563·cs.CV·March 13, 2026

SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning

Yuyuan Yang, Junkun Hong, Hongrong Wang, Honghao Cai, Xunpeng Ren, Ge Wang, Mingcong Lei, Shenhao Yan, Jiahao Yang, Chengsi Yao, Xi Li, Yiming Zhao, Yatong Han, Jinke Ren

PDF

Open Access

TL;DR

SVLL introduces a staged training framework with Bias-DPO to improve physically grounded embodied task planning, ensuring safer, more reliable action sequences in vision-language models.

Contribution

The paper proposes SVLL, a three-stage training approach with Bias-DPO, to enhance temporal reasoning and safety in embodied planning tasks, addressing limitations of existing methods.

Findings

01

SVLL outperforms state-of-the-art models on AI2-THOR benchmark.

02

SVLL reduces physical constraint violations significantly.

03

SVLL achieves higher task success rates in real-world robotic deployments.

Abstract

Embodied task planning demands vision-language models to generate action sequences that are both visually grounded and causally coherent over time. However, existing training paradigms face a critical trade-off: joint end-to-end training often leads to premature temporal binding, while standard reinforcement learning methods suffer from optimization instability. To bridge this gap, we present Staged Vision-Language Learning (SVLL), a unified three-stage framework for robust, physically-grounded embodied planning. In the first two stages, SVLL decouples spatial grounding from temporal reasoning, establishing robust visual dependency before introducing sequential action history. In the final stage, we identify a key limitation of standard Direct Preference Optimization (DPO), its purely relative nature -- optimizing only the preference gap between winning and losing trajectories while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Robot Manipulation and Learning