UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent

Jianke Zhang; Yanjiang Guo; Yucheng Hu; Xiaoyu Chen; Xiang Zhu; Jianyu Chen

arXiv:2501.18867·cs.CV·June 27, 2025

UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent

Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, Jianyu Chen

PDF

Open Access

TL;DR

UP-VLA is a unified model that enhances embodied agent understanding and prediction by combining semantic comprehension with spatial reasoning, leading to significant improvements in benchmark and real-world tasks.

Contribution

The paper introduces UP-VLA, a novel training paradigm that jointly optimizes understanding and prediction for embodied agents, addressing limitations of existing VLA models.

Findings

01

33% improvement on Calvin ABC-D benchmark

02

Enhanced success in real-world spatial manipulation tasks

03

Better low-level spatial understanding in embodied control

Abstract

Recent advancements in Vision-Language-Action (VLA) models have leveraged pre-trained Vision-Language Models (VLMs) to improve the generalization capabilities. VLMs, typically pre-trained on vision-language understanding tasks, provide rich semantic knowledge and reasoning abilities. However, prior research has shown that VLMs often focus on high-level semantic content and neglect low-level features, limiting their ability to capture detailed spatial information and understand physical dynamics. These aspects, which are crucial for embodied control tasks, remain underexplored in existing pre-training paradigms. In this paper, we investigate the training paradigm for VLAs, and introduce \textbf{UP-VLA}, a \textbf{U}nified VLA model training with both multi-modal \textbf{U}nderstanding and future \textbf{P}rediction objectives, enhancing both high-level semantic comprehension and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications

MethodsFocus