Language-Conditioned Open-Vocabulary Mobile Manipulation with Pretrained Models

Shen Tan; Dong Zhou; Xiangyu Shao; Junqiao Wang; Guanghui Sun

arXiv:2507.17379·cs.RO·July 24, 2025

Language-Conditioned Open-Vocabulary Mobile Manipulation with Pretrained Models

Shen Tan, Dong Zhou, Xiangyu Shao, Junqiao Wang, Guanghui Sun

PDF

Open Access

TL;DR

This paper introduces LOVMM, a framework that combines large language and vision-language models to enable robots to perform open-vocabulary mobile manipulation tasks in household environments using natural language commands, demonstrating strong zero-shot and multi-task capabilities.

Contribution

The novel LOVMM framework integrates LLMs and VLMs for open-vocabulary mobile manipulation, enabling robots to understand and execute complex natural language instructions in household settings.

Findings

01

Strong zero-shot generalization in household environments

02

Effective multi-task learning capabilities

03

Higher success rates than state-of-the-art methods

Abstract

Open-vocabulary mobile manipulation (OVMM) that involves the handling of novel and unseen objects across different workspaces remains a significant challenge for real-world robotic applications. In this paper, we propose a novel Language-conditioned Open-Vocabulary Mobile Manipulation framework, named LOVMM, incorporating the large language model (LLM) and vision-language model (VLM) to tackle various mobile manipulation tasks in household environments. Our approach is capable of solving various OVMM tasks with free-form natural language instructions (e.g. "toss the food boxes on the office room desk to the trash bin in the corner", and "pack the bottles from the bed to the box in the guestroom"). Extensive experiments simulated in complex household environments show strong zero-shot generalization and multi-task learning abilities of LOVMM. Moreover, our approach can also generalize to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Multi-Agent Systems and Negotiation